Qwen2.5-7B-Instruct

Updated
27.06.2026
Reasoning
Code
Multilingual
Tools

A 7.61B instruction-tuned LLM from Alibaba's Qwen2.5 series with strong coding, math, and multilingual ability across 29+ languages.

pip install -U transformers
huggingface-cli download Qwen/Qwen2.5-7B-Instruct
# or run quantized via Ollama:
ollama run qwen2.5:7b-instruct
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [{"role": "user", "content": "Give me a short intro to LLMs."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "sk-local" });
const res = await client.chat.completions.create({
  model: "Qwen/Qwen2.5-7B-Instruct",
  messages: [{ role: "user", content: "Hello!" }]
});
console.log(res.choices[0].message.content);

More models

NameSize / UsageContextInput
Qwen3.6-35B-A3B
256KText, Image
Qwen3.6-27B
256KText, Image
Qwen3-8B
Reasoning, coding, agentic chat128KText
Qwen2.5-32B-Instruct
General chat, coding, reasoning128KText
Qwen2.5-14B-Instruct
General chat, coding, reasoning128KText
Qwen2.5-Coder-7B-Instruct
Code generation, code reasoning128KText
Qwen2.5-Coder-32B-Instruct
Code generation, fixing, agents128KText
Qwen3-235B-A22B
Reasoning, coding, agentic tasks128KText
Qwen3-30B-A3B-Instruct-2507
General chat, reasoning, agents256KText
Qwen2.5-72B-Instruct
General chat, coding, reasoning128KText
Qwen3-4B-Thinking-2507
Reasoning, math, coding256KText
WebWorld-8B
Web agents, multimodal reasoning40KText, Image
MiniCPM-V 4.6
5213 GB421KText, Image
anima
421 GB31KText
Qwen3-Coder-30B-A3B-Instruct
256KText
Qwen3-30B-A3B
128KText
Qwen3-14B
128KText
Qwen3-32B
128KText

At a glance

  • License: Apache 2.0
  • Context length: 128K tokens (32K default)
  • Parameters: 7.61B (28 layers, GQA)
  • Languages: 29+ including English, Chinese, French, Spanish
  • Minimum hardware: ~8 GB VRAM (4-bit), ~16 GB (full precision)
  • Strengths: coding, math, instruction following, structured output

Overview

Qwen2.5-7B-Instruct is the instruction-tuned 7-billion-parameter model in Alibaba Cloud's Qwen2.5 series, released in September 2024. It has 7.61B parameters across 28 transformer layers and uses RoPE, SwiGLU, RMSNorm, and grouped-query attention (28 query heads, 4 key/value heads). The Qwen2.5 family spans 0.5B to 72B, and this 7B variant is the mid-size workhorse meant for general assistants that run on a single consumer GPU. It was post-trained on top of the Qwen2.5-7B base, which Qwen trained on roughly 18 trillion tokens.

What it's good at

Compared with Qwen2, this release adds noticeably more knowledge and stronger coding and mathematics, drawing on Qwen's domain-specialist expert models. It follows instructions more reliably, generates long outputs beyond 8K tokens, reads structured data such as tables, and produces clean JSON. Multilingual coverage extends past 29 languages, including English, Chinese, French, Spanish, Portuguese, German, Russian, Japanese, Korean, and Arabic. The model is also more resilient to varied system prompts, which helps with role-play and chatbot conditioning.

Running locally

At full precision the weights need about 16 GB of VRAM, so a 16 GB GPU handles it directly. A 4-bit GGUF or AWQ quant brings that down to roughly 8 GB, and quantized GGUF builds run on CPU or Apple Silicon through llama.cpp and Ollama. For serving, vLLM is the recommended high-throughput option. Context defaults to 32,768 tokens in the shipped config; reaching the full 131,072-token window requires enabling YaRN rope scaling, and Qwen notes static YaRN can slightly hurt shorter prompts, so enable it only when you actually process long inputs.

License

Qwen2.5-7B-Instruct is distributed under Apache 2.0. That permits commercial use, modification, and redistribution without royalties, and only asks that you preserve the license and attribution notices. The weights are openly available on Hugging Face, so you can self-host the model with no API key.

Desktop
macOS
(M1 or better)
Download
Windows
(x64)
Download
Linux
(x86_64)
Download

Frequently asked questions

Qwen2.5-7B-Instruct is an instruction-tuned large language model with 7.61 billion parameters, built by Alibaba Cloud's Qwen team and released in September 2024. It is post-trained for chat, instruction following, coding, and mathematics, and supports a context window of up to 128K tokens.

At full precision the 7.61B model needs roughly 16 GB of VRAM, so a single 16 GB GPU works well. With 4-bit quantization (GGUF or AWQ) it runs comfortably on an 8 GB GPU, and quantized GGUF builds can also run on CPU or Apple Silicon through llama.cpp and Ollama.

Yes. Qwen2.5-7B-Instruct is released under the Apache 2.0 license, which permits free use, modification, redistribution, and commercial deployment with no royalty. The weights are openly downloadable from Hugging Face, so you can self-host it without an API.

Qwen2.5-7B-Instruct supports a maximum context length of 131,072 tokens (128K) and can generate up to 8,192 tokens in a single response. The shipped config defaults to 32,768 tokens; handling the full 128K window requires enabling YaRN rope scaling, which frameworks like vLLM support.

The model supports more than 29 languages, including English, Chinese, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, and Arabic. Qwen2.5 also improved structured-data understanding and JSON output, which carries across these languages.