Qwen2.5-Coder-32B-Instruct

Updated
27.06.2026
Code
Reasoning
Tools
Multilingual

A 32.5B code-specialized LLM from Alibaba's Qwen2.5-Coder series with open-model state-of-the-art coding ability and 128K context.

pip install -U transformers
huggingface-cli download Qwen/Qwen2.5-Coder-32B-Instruct
# or run quantized locally:
ollama run qwen2.5-coder:32b
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-Coder-32B-Instruct",
    "messages": [{"role": "user", "content": "Refactor this function for readability."}]
  }'
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-Coder-32B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "not-needed" });
const res = await client.chat.completions.create({
  model: "Qwen/Qwen2.5-Coder-32B-Instruct",
  messages: [{ role: "user", content: "Write a binary search in TypeScript." }],
});
console.log(res.choices[0].message.content);

More models

NameSize / UsageContextInput
Qwen3.6-35B-A3B
256KText, Image
Qwen3.6-27B
256KText, Image
Qwen3-8B
Reasoning, coding, agentic chat128KText
Qwen2.5-7B-Instruct
General chat, coding, reasoning128KText
Qwen2.5-32B-Instruct
General chat, coding, reasoning128KText
Qwen2.5-14B-Instruct
General chat, coding, reasoning128KText
Qwen2.5-Coder-7B-Instruct
Code generation, code reasoning128KText
Qwen3-235B-A22B
Reasoning, coding, agentic tasks128KText
Qwen3-30B-A3B-Instruct-2507
General chat, reasoning, agents256KText
Qwen2.5-72B-Instruct
General chat, coding, reasoning128KText
Qwen3-4B-Thinking-2507
Reasoning, math, coding256KText
WebWorld-8B
Web agents, multimodal reasoning40KText, Image
MiniCPM-V 4.6
5213 GB421KText, Image
anima
421 GB31KText
Qwen3-Coder-30B-A3B-Instruct
256KText
Qwen3-30B-A3B
128KText
Qwen3-14B
128KText
Qwen3-32B
128KText

At a glance

  • License: Apache 2.0
  • Parameters: 32.5B (dense)
  • Context length: 128K tokens (131,072)
  • Training data: 5.5T tokens of code, text-code, and synthetic data
  • Minimum hardware: ~24 GB VRAM at 4-bit
  • Strengths: code generation, code fixing, code reasoning, 40+ programming languages

Overview

Qwen2.5-Coder-32B-Instruct is the largest model in Alibaba's Qwen2.5-Coder series, a line of code-specialized LLMs formerly known as CodeQwen. The series spans six sizes from 0.5B to 32B; this 32.5B instruction-tuned model is the flagship. It was built on the Qwen2.5 base and trained on 5.5 trillion tokens spanning source code, text-code grounding data, and synthetic data. The architecture is a causal transformer with RoPE, SwiGLU, RMSNorm, GQA attention (40 query heads, 8 key/value heads), and 64 layers.

What it is good at

At release it was the strongest open-source code model, with coding ability Qwen reports as comparable to GPT-4o. It leads open models on EvalPlus, LiveCodeBench, and BigCodeBench, scores 73.7 on the Aider code-repair benchmark, and reaches 65.9 on McEval across more than 40 programming languages. Beyond raw generation it handles code reasoning and code fixing, and it keeps the math and general reasoning strengths of the Qwen2.5 base, which makes it a practical backbone for code agents.

Running locally

The dense 32B model runs on a single 24 GB GPU such as an RTX 3090 or 4090 at 4-bit quantization. Full BF16 inference needs roughly 65 GB and usually two cards. There are over a hundred community quantizations in GGUF, AWQ, and GPTQ formats, so it also runs on Apple Silicon Macs with 32 GB or more of unified memory through llama.cpp or Ollama. For long-context work past 32K tokens you enable YaRN rope scaling in config.json; Qwen recommends turning it on only when needed, since static YaRN can reduce quality on short inputs. vLLM is the recommended high-throughput serving framework.

License

The model is released under Apache 2.0. That permits commercial use, modification, and redistribution, and only requires preserving the license and attribution notices. There is no separate acceptable-use addendum, which makes it straightforward to embed in commercial products.

Desktop
macOS
(M1 or better)
Download
Windows
(x64)
Download
Linux
(x86_64)
Download

Frequently asked questions

Qwen2.5-Coder-32B-Instruct is a 32.5B-parameter, code-specialized large language model from Alibaba's Qwen team. It is the instruction-tuned variant of Qwen2.5-Coder-32B, trained on 5.5 trillion tokens of source code, text-code grounding data, and synthetic data. It targets code generation, code reasoning, and code fixing, and supports a 128K-token context window.

Yes. Qwen2.5-Coder-32B-Instruct is released under the Apache 2.0 license, which permits free commercial and private use, modification, and redistribution with attribution. The weights are published on Hugging Face and can be downloaded and run locally at no cost.

At a 4-bit quantization the 32B model fits in roughly 24 GB of VRAM, so a single 24 GB GPU such as an RTX 3090 or 4090 can run it. Running in full BF16 precision needs about 65 GB and typically two high-memory GPUs. The model also runs on Apple Silicon Macs with 32 GB or more of unified memory using llama.cpp or Ollama with quantized GGUF builds.

It was the strongest open-source code model at release, with coding ability the Qwen team reports as matching GPT-4o. It leads open models on EvalPlus, LiveCodeBench, and BigCodeBench, scores 73.7 on the Aider code-repair benchmark, and reaches 65.9 on McEval across more than 40 programming languages.

The model supports a full context length of 131,072 tokens (128K). The default config.json caps context at 32,768 tokens; to process longer inputs you enable YaRN rope scaling in the config, which Qwen recommends only when long-context handling is actually needed since static YaRN can affect performance on shorter inputs.