Overview
Qwen2.5-Coder-32B-Instruct is the largest model in Alibaba's Qwen2.5-Coder series, a line of code-specialized LLMs formerly known as CodeQwen. The series spans six sizes from 0.5B to 32B; this 32.5B instruction-tuned model is the flagship. It was built on the Qwen2.5 base and trained on 5.5 trillion tokens spanning source code, text-code grounding data, and synthetic data. The architecture is a causal transformer with RoPE, SwiGLU, RMSNorm, GQA attention (40 query heads, 8 key/value heads), and 64 layers.
What it is good at
At release it was the strongest open-source code model, with coding ability Qwen reports as comparable to GPT-4o. It leads open models on EvalPlus, LiveCodeBench, and BigCodeBench, scores 73.7 on the Aider code-repair benchmark, and reaches 65.9 on McEval across more than 40 programming languages. Beyond raw generation it handles code reasoning and code fixing, and it keeps the math and general reasoning strengths of the Qwen2.5 base, which makes it a practical backbone for code agents.
Running locally
The dense 32B model runs on a single 24 GB GPU such as an RTX 3090 or 4090 at 4-bit quantization. Full BF16 inference needs roughly 65 GB and usually two cards. There are over a hundred community quantizations in GGUF, AWQ, and GPTQ formats, so it also runs on Apple Silicon Macs with 32 GB or more of unified memory through llama.cpp or Ollama. For long-context work past 32K tokens you enable YaRN rope scaling in config.json; Qwen recommends turning it on only when needed, since static YaRN can reduce quality on short inputs. vLLM is the recommended high-throughput serving framework.
License
The model is released under Apache 2.0. That permits commercial use, modification, and redistribution, and only requires preserving the license and attribution notices. There is no separate acceptable-use addendum, which makes it straightforward to embed in commercial products.


