Qwen2.5-Coder-7B-Instruct

Updated

Code

Reasoning

Tools

Run

A 7.6B code-specialized LLM from Alibaba's Qwen2.5-Coder series, tuned for code generation, reasoning, and fixing.

pip install -U transformers
huggingface-cli download Qwen/Qwen2.5-Coder-7B-Instruct
# or run quantized via Ollama:
ollama run qwen2.5-coder:7b-instruct

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-Coder-7B-Instruct",
    "messages": [{"role": "user", "content": "Write a quicksort in Python."}]
  }'

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-Coder-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "sk-local" });
const res = await client.chat.completions.create({
  model: "Qwen/Qwen2.5-Coder-7B-Instruct",
  messages: [{ role: "user", content: "Write a quicksort in Python." }],
});
console.log(res.choices[0].message.content);

More models

View all

Name	Size / Usage	Context	Input
Qwen3.6-35B-A3B		256K	Text, Image
Qwen3.6-27B		256K	Text, Image
Qwen3-8B	Reasoning, coding, agentic chat	128K	Text
Qwen2.5-7B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-32B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-14B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-Coder-32B-Instruct	Code generation, fixing, agents	128K	Text
Qwen3-235B-A22B	Reasoning, coding, agentic tasks	128K	Text
Qwen3-30B-A3B-Instruct-2507	General chat, reasoning, agents	256K	Text
Qwen2.5-72B-Instruct	General chat, coding, reasoning	128K	Text
Qwen3-4B-Thinking-2507	Reasoning, math, coding	256K	Text
WebWorld-8B	Web agents, multimodal reasoning	40K	Text, Image
MiniCPM-V 4.6	5213 GB	421K	Text, Image
anima	421 GB	31K	Text
Qwen3-Coder-30B-A3B-Instruct		256K	Text
Qwen3-30B-A3B		128K	Text
Qwen3-14B		128K	Text
Qwen3-32B		128K	Text

At a glance

License: Apache 2.0
Context length: 128K tokens (with YaRN)
Parameters: 7.61B (28 layers, GQA)
Minimum hardware: ~8 GB VRAM (4-bit)
Strengths: code generation, code reasoning, bug fixing

Overview

Qwen2.5-Coder-7B-Instruct is the instruction-tuned 7.61B-parameter member of Alibaba's Qwen2.5-Coder family, released in September 2024 as the successor to CodeQwen1.5. The series spans six sizes from 0.5B to 32B; the 7B version targets developers who want a capable coding assistant that still fits on a single consumer GPU. It is built on the Qwen2.5 base and uses a Qwen2 causal architecture with RoPE, SwiGLU, RMSNorm, QKV bias, and grouped-query attention (28 query heads, 4 key/value heads) across 28 layers.

What it's good at

The model is specialized for code generation, code reasoning, and bug fixing. The Qwen2.5-Coder series was trained on 5.5 trillion tokens of source code, text-code grounding, and synthetic data, and supports a wide range of programming languages. While the 32B flagship is the one the Qwen team compares to GPT-4o on coding, the 7B variant keeps strong code completion and editing quality while staying fast enough for IDE integration and local agents. It also retains general math and reasoning ability inherited from Qwen2.5.

Running locally

At BF16 the model needs about 16 GB of VRAM; 4-bit GGUF or GPTQ quantizations bring that down to roughly 8 GB, so it runs on cards like an RTX 3060 or on Apple Silicon. You can load it through Hugging Face transformers (4.37 or newer), serve it with vLLM, or run quantized builds via Ollama and llama.cpp. The default config sets context to 32,768 tokens; enabling YaRN rope scaling extends it to the full 131,072-token (128K) window, which the team recommends only when long inputs are actually needed.

License

Qwen2.5-Coder-7B-Instruct is released under the Apache 2.0 license. That permits commercial use, modification, and redistribution, requiring only that you preserve the license and attribution notices. The open weights are downloadable from Hugging Face for self-hosting.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

Qwen2.5-Coder-7B-Instruct is a 7.61B-parameter, code-specialized large language model from Alibaba's Qwen team, released in September 2024. It is the instruction-tuned variant of Qwen2.5-Coder-7B, fine-tuned for code generation, code reasoning, and bug fixing. The Qwen2.5-Coder series was trained on 5.5 trillion tokens including source code and text-code grounding data.

At full BF16 precision the 7.61B-parameter model needs roughly 16 GB of VRAM. With 4-bit quantization (GGUF or GPTQ) it fits comfortably in about 8 GB, making it runnable on a single consumer GPU such as an RTX 3060/4060 or even on Apple Silicon via llama.cpp. Quantized GGUF builds are available for use with Ollama and llama.cpp.

Yes. Qwen2.5-Coder-7B-Instruct is released under the Apache 2.0 license, which permits free commercial use, modification, and redistribution with attribution. The weights are openly available on Hugging Face, so you can download and self-host the model at no cost.

The model supports a context window of up to 131,072 tokens (128K). The default config.json is set to 32,768 tokens; to handle inputs beyond that, you enable YaRN rope scaling in the config. For long-context deployment the Qwen team recommends vLLM, noting that static YaRN can slightly affect performance on shorter inputs.

Qwen2.5-Coder-32B-Instruct is the flagship of the series and scores higher on coding benchmarks, with the Qwen team describing its coding ability as matching GPT-4o. The 7B model trades some accuracy for far lower hardware demands: it runs on a single consumer GPU and is much faster, which makes it a practical choice for local coding assistants and IDE integration when a 32B model is too heavy.