Qwen2.5-7B-Instruct

Updated

Reasoning

Code

Multilingual

Tools

Run

A 7.61B instruction-tuned LLM from Alibaba's Qwen2.5 series with strong coding, math, and multilingual ability across 29+ languages.

pip install -U transformers
huggingface-cli download Qwen/Qwen2.5-7B-Instruct
# or run quantized via Ollama:
ollama run qwen2.5:7b-instruct

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [{"role": "user", "content": "Give me a short intro to LLMs."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "sk-local" });
const res = await client.chat.completions.create({
  model: "Qwen/Qwen2.5-7B-Instruct",
  messages: [{ role: "user", content: "Hello!" }]
});
console.log(res.choices[0].message.content);

More models

View all

Name	Size / Usage	Context	Input
Qwen3.6-35B-A3B		256K	Text, Image
Qwen3.6-27B		256K	Text, Image
Qwen3-8B	Reasoning, coding, agentic chat	128K	Text
Qwen2.5-32B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-14B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-Coder-7B-Instruct	Code generation, code reasoning	128K	Text
Qwen2.5-Coder-32B-Instruct	Code generation, fixing, agents	128K	Text
Qwen3-235B-A22B	Reasoning, coding, agentic tasks	128K	Text
Qwen3-30B-A3B-Instruct-2507	General chat, reasoning, agents	256K	Text
Qwen2.5-72B-Instruct	General chat, coding, reasoning	128K	Text
Qwen3-4B-Thinking-2507	Reasoning, math, coding	256K	Text
WebWorld-8B	Web agents, multimodal reasoning	40K	Text, Image
MiniCPM-V 4.6	5213 GB	421K	Text, Image
anima	421 GB	31K	Text
Qwen3-Coder-30B-A3B-Instruct		256K	Text
Qwen3-30B-A3B		128K	Text
Qwen3-14B		128K	Text
Qwen3-32B		128K	Text

At a glance

License: Apache 2.0
Context length: 128K tokens (32K default)
Parameters: 7.61B (28 layers, GQA)
Languages: 29+ including English, Chinese, French, Spanish
Minimum hardware: ~8 GB VRAM (4-bit), ~16 GB (full precision)
Strengths: coding, math, instruction following, structured output

Overview

Qwen2.5-7B-Instruct is the instruction-tuned 7-billion-parameter model in Alibaba Cloud's Qwen2.5 series, released in September 2024. It has 7.61B parameters across 28 transformer layers and uses RoPE, SwiGLU, RMSNorm, and grouped-query attention (28 query heads, 4 key/value heads). The Qwen2.5 family spans 0.5B to 72B, and this 7B variant is the mid-size workhorse meant for general assistants that run on a single consumer GPU. It was post-trained on top of the Qwen2.5-7B base, which Qwen trained on roughly 18 trillion tokens.

What it's good at

Compared with Qwen2, this release adds noticeably more knowledge and stronger coding and mathematics, drawing on Qwen's domain-specialist expert models. It follows instructions more reliably, generates long outputs beyond 8K tokens, reads structured data such as tables, and produces clean JSON. Multilingual coverage extends past 29 languages, including English, Chinese, French, Spanish, Portuguese, German, Russian, Japanese, Korean, and Arabic. The model is also more resilient to varied system prompts, which helps with role-play and chatbot conditioning.

Running locally

At full precision the weights need about 16 GB of VRAM, so a 16 GB GPU handles it directly. A 4-bit GGUF or AWQ quant brings that down to roughly 8 GB, and quantized GGUF builds run on CPU or Apple Silicon through llama.cpp and Ollama. For serving, vLLM is the recommended high-throughput option. Context defaults to 32,768 tokens in the shipped config; reaching the full 131,072-token window requires enabling YaRN rope scaling, and Qwen notes static YaRN can slightly hurt shorter prompts, so enable it only when you actually process long inputs.

License

Qwen2.5-7B-Instruct is distributed under Apache 2.0. That permits commercial use, modification, and redistribution without royalties, and only asks that you preserve the license and attribution notices. The weights are openly available on Hugging Face, so you can self-host the model with no API key.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

Qwen2.5-7B-Instruct is an instruction-tuned large language model with 7.61 billion parameters, built by Alibaba Cloud's Qwen team and released in September 2024. It is post-trained for chat, instruction following, coding, and mathematics, and supports a context window of up to 128K tokens.

At full precision the 7.61B model needs roughly 16 GB of VRAM, so a single 16 GB GPU works well. With 4-bit quantization (GGUF or AWQ) it runs comfortably on an 8 GB GPU, and quantized GGUF builds can also run on CPU or Apple Silicon through llama.cpp and Ollama.

Yes. Qwen2.5-7B-Instruct is released under the Apache 2.0 license, which permits free use, modification, redistribution, and commercial deployment with no royalty. The weights are openly downloadable from Hugging Face, so you can self-host it without an API.

Qwen2.5-7B-Instruct supports a maximum context length of 131,072 tokens (128K) and can generate up to 8,192 tokens in a single response. The shipped config defaults to 32,768 tokens; handling the full 128K window requires enabling YaRN rope scaling, which frameworks like vLLM support.

The model supports more than 29 languages, including English, Chinese, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, and Arabic. Qwen2.5 also improved structured-data understanding and JSON output, which carries across these languages.