Qwen2.5-32B-Instruct

Updated

Reasoning

Code

Multilingual

Tools

Run

A 32.5B instruction-tuned LLM from Alibaba's Qwen2.5 series with strong coding, math, and 29+ language support.

pip install -U transformers
huggingface-cli download Qwen/Qwen2.5-32B-Instruct
# or via Ollama
ollama run qwen2.5:32b-instruct

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-32B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-32B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [{"role": "user", "content": "Give me a short intro to LLMs."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "sk-noauth" });
const res = await client.chat.completions.create({
  model: "Qwen/Qwen2.5-32B-Instruct",
  messages: [{ role: "user", content: "Hello!" }]
});
console.log(res.choices[0].message.content);

More models

View all

Name	Size / Usage	Context	Input
Qwen3.6-35B-A3B		256K	Text, Image
Qwen3.6-27B		256K	Text, Image
Qwen3-8B	Reasoning, coding, agentic chat	128K	Text
Qwen2.5-7B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-14B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-Coder-7B-Instruct	Code generation, code reasoning	128K	Text
Qwen2.5-Coder-32B-Instruct	Code generation, fixing, agents	128K	Text
Qwen3-235B-A22B	Reasoning, coding, agentic tasks	128K	Text
Qwen3-30B-A3B-Instruct-2507	General chat, reasoning, agents	256K	Text
Qwen2.5-72B-Instruct	General chat, coding, reasoning	128K	Text
Qwen3-4B-Thinking-2507	Reasoning, math, coding	256K	Text
WebWorld-8B	Web agents, multimodal reasoning	40K	Text, Image
MiniCPM-V 4.6	5213 GB	421K	Text, Image
anima	421 GB	31K	Text
Qwen3-Coder-30B-A3B-Instruct		256K	Text
Qwen3-30B-A3B		128K	Text
Qwen3-14B		128K	Text
Qwen3-32B		128K	Text

At a glance

License: Apache 2.0
Parameters: 32.5B (31.0B non-embedding)
Context length: 128K tokens (up to 8K output)
Languages: 29+ languages
Minimum hardware: ~24 GB VRAM at 4-bit
Strengths: coding, math, instruction following, structured output

Overview

Qwen2.5-32B-Instruct is a 32.5-billion-parameter instruction-tuned language model from Alibaba Cloud's Qwen team, released in September 2024 as part of the Qwen2.5 family. It is post-trained from the Qwen2.5-32B base checkpoint and sits between the 14B and 72B models in the lineup. The architecture is a dense causal transformer with 64 layers, grouped-query attention (40 query heads, 8 key/value heads), RoPE positional encoding, SwiGLU activations, RMSNorm, and QKV bias.

What it's good at

Compared to Qwen2, this generation adds noticeably more knowledge and stronger coding and mathematics, helped by specialized expert models used during training. It follows instructions more reliably, generates long texts past 8K tokens, understands structured data such as tables, and produces clean JSON output. It handles over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Russian, Japanese, Korean, and Arabic. The built-in chat template supports tool calling via tags, so it works as an agent backbone. In Qwen's own evaluations the 32B model surpasses the older Qwen2-72B across many tasks while using far less compute.

Running locally

At 4-bit quantization (GGUF Q4_K_M, AWQ, or GPTQ) the model needs roughly 20 GB and fits on a single 24 GB GPU such as an RTX 3090 or 4090, or a 32 GB Apple Silicon Mac. Full FP16 weights need about 64 GB, so unquantized use means multiple GPUs. You can run it through transformers, vLLM, llama.cpp, LM Studio, or Ollama. The context window is 128K tokens, but the shipped config defaults to 32,768; enable YaRN rope scaling to use the full length, and only when you need it, since static YaRN can slightly hurt short prompts.

License

Qwen2.5-32B-Instruct is released under the Apache 2.0 license. That allows commercial use, modification, redistribution, and fine-tuning without fees, subject to the standard attribution and notice requirements of Apache 2.0.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

Qwen2.5-32B-Instruct is a 32.5-billion-parameter instruction-tuned large language model from Alibaba Cloud's Qwen team, released in September 2024. It is part of the Qwen2.5 series and is fine-tuned from the Qwen2.5-32B base model for chat, instruction following, coding, and mathematics. It supports a 128K-token context window and over 29 languages.

At 4-bit quantization (Q4_K_M / GGUF or AWQ), Qwen2.5-32B-Instruct needs roughly 20 GB of memory and runs on a single 24 GB GPU such as an RTX 3090 or RTX 4090, or an Apple Silicon Mac with 32 GB of unified memory. Full FP16 precision requires about 64 GB, which means multiple GPUs or a 64 GB+ Mac. Reducing the context length lowers memory use further.

Yes. Qwen2.5-32B-Instruct is released under the Apache 2.0 license, which permits both research and commercial use with no licensing fees. You can download the weights from Hugging Face, run them locally, fine-tune them, and deploy them in commercial products as long as you comply with the standard Apache 2.0 terms.

Qwen2.5-32B-Instruct supports a full context length of 131,072 tokens (128K) and can generate up to 8,192 tokens in a single response. The shipped config.json defaults to 32,768 tokens; to use the full 128K window you enable YaRN rope scaling, which the Qwen team recommends turning on only when you actually need long-context processing.

The 72B model scores higher across most benchmarks, including MMLU, GSM8K, and HumanEval. But Qwen2.5-32B-Instruct gives a much better performance-per-GPU ratio: it runs on a single 24 GB card at 4-bit, while the 72B model needs far more memory. For many chat, reasoning, and coding tasks the 32B version is the practical choice for local deployment, and it outperforms the older Qwen2-72B.