Qwen2.5-Coder-32B-Instruct

Updated

Code

Reasoning

Tools

Multilingual

Run

A 32.5B code-specialized LLM from Alibaba's Qwen2.5-Coder series with open-model state-of-the-art coding ability and 128K context.

pip install -U transformers
huggingface-cli download Qwen/Qwen2.5-Coder-32B-Instruct
# or run quantized locally:
ollama run qwen2.5-coder:32b

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-Coder-32B-Instruct",
    "messages": [{"role": "user", "content": "Refactor this function for readability."}]
  }'

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-Coder-32B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "not-needed" });
const res = await client.chat.completions.create({
  model: "Qwen/Qwen2.5-Coder-32B-Instruct",
  messages: [{ role: "user", content: "Write a binary search in TypeScript." }],
});
console.log(res.choices[0].message.content);

More models

View all

Name	Size / Usage	Context	Input
Qwen3.6-35B-A3B		256K	Text, Image
Qwen3.6-27B		256K	Text, Image
Qwen3-8B	Reasoning, coding, agentic chat	128K	Text
Qwen2.5-7B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-32B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-14B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-Coder-7B-Instruct	Code generation, code reasoning	128K	Text
Qwen3-235B-A22B	Reasoning, coding, agentic tasks	128K	Text
Qwen3-30B-A3B-Instruct-2507	General chat, reasoning, agents	256K	Text
Qwen2.5-72B-Instruct	General chat, coding, reasoning	128K	Text
Qwen3-4B-Thinking-2507	Reasoning, math, coding	256K	Text
WebWorld-8B	Web agents, multimodal reasoning	40K	Text, Image
MiniCPM-V 4.6	5213 GB	421K	Text, Image
anima	421 GB	31K	Text
Qwen3-Coder-30B-A3B-Instruct		256K	Text
Qwen3-30B-A3B		128K	Text
Qwen3-14B		128K	Text
Qwen3-32B		128K	Text

At a glance

License: Apache 2.0
Parameters: 32.5B (dense)
Context length: 128K tokens (131,072)
Training data: 5.5T tokens of code, text-code, and synthetic data
Minimum hardware: ~24 GB VRAM at 4-bit
Strengths: code generation, code fixing, code reasoning, 40+ programming languages

Overview

Qwen2.5-Coder-32B-Instruct is the largest model in Alibaba's Qwen2.5-Coder series, a line of code-specialized LLMs formerly known as CodeQwen. The series spans six sizes from 0.5B to 32B; this 32.5B instruction-tuned model is the flagship. It was built on the Qwen2.5 base and trained on 5.5 trillion tokens spanning source code, text-code grounding data, and synthetic data. The architecture is a causal transformer with RoPE, SwiGLU, RMSNorm, GQA attention (40 query heads, 8 key/value heads), and 64 layers.

What it is good at

At release it was the strongest open-source code model, with coding ability Qwen reports as comparable to GPT-4o. It leads open models on EvalPlus, LiveCodeBench, and BigCodeBench, scores 73.7 on the Aider code-repair benchmark, and reaches 65.9 on McEval across more than 40 programming languages. Beyond raw generation it handles code reasoning and code fixing, and it keeps the math and general reasoning strengths of the Qwen2.5 base, which makes it a practical backbone for code agents.

Running locally

The dense 32B model runs on a single 24 GB GPU such as an RTX 3090 or 4090 at 4-bit quantization. Full BF16 inference needs roughly 65 GB and usually two cards. There are over a hundred community quantizations in GGUF, AWQ, and GPTQ formats, so it also runs on Apple Silicon Macs with 32 GB or more of unified memory through llama.cpp or Ollama. For long-context work past 32K tokens you enable YaRN rope scaling in config.json; Qwen recommends turning it on only when needed, since static YaRN can reduce quality on short inputs. vLLM is the recommended high-throughput serving framework.

License

The model is released under Apache 2.0. That permits commercial use, modification, and redistribution, and only requires preserving the license and attribution notices. There is no separate acceptable-use addendum, which makes it straightforward to embed in commercial products.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

Qwen2.5-Coder-32B-Instruct is a 32.5B-parameter, code-specialized large language model from Alibaba's Qwen team. It is the instruction-tuned variant of Qwen2.5-Coder-32B, trained on 5.5 trillion tokens of source code, text-code grounding data, and synthetic data. It targets code generation, code reasoning, and code fixing, and supports a 128K-token context window.

Yes. Qwen2.5-Coder-32B-Instruct is released under the Apache 2.0 license, which permits free commercial and private use, modification, and redistribution with attribution. The weights are published on Hugging Face and can be downloaded and run locally at no cost.

At a 4-bit quantization the 32B model fits in roughly 24 GB of VRAM, so a single 24 GB GPU such as an RTX 3090 or 4090 can run it. Running in full BF16 precision needs about 65 GB and typically two high-memory GPUs. The model also runs on Apple Silicon Macs with 32 GB or more of unified memory using llama.cpp or Ollama with quantized GGUF builds.

It was the strongest open-source code model at release, with coding ability the Qwen team reports as matching GPT-4o. It leads open models on EvalPlus, LiveCodeBench, and BigCodeBench, scores 73.7 on the Aider code-repair benchmark, and reaches 65.9 on McEval across more than 40 programming languages.

The model supports a full context length of 131,072 tokens (128K). The default config.json caps context at 32,768 tokens; to process longer inputs you enable YaRN rope scaling in the config, which Qwen recommends only when long-context handling is actually needed since static YaRN can affect performance on shorter inputs.