Qwen2.5-72B-Instruct

Updated
27.06.2026
Reasoning
Code
Multilingual
Tools

A 72.7B instruction-tuned LLM from Alibaba's Qwen2.5 series with strong coding, math and multilingual ability across 29+ languages.

pip install -U transformers
huggingface-cli download Qwen/Qwen2.5-72B-Instruct
# Or serve with vLLM:
vllm serve Qwen/Qwen2.5-72B-Instruct --tensor-parallel-size 2
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-72B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-72B-Instruct", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-72B-Instruct")
messages = [{"role": "user", "content": "Give me a short intro to LLMs."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(out[0], skip_special_tokens=True))
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "EMPTY" });
const res = await client.chat.completions.create({
  model: "Qwen/Qwen2.5-72B-Instruct",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(res.choices[0].message.content);

More models

NameSize / UsageContextInput
Qwen3.6-35B-A3B
256KText, Image
Qwen3.6-27B
256KText, Image
Qwen3-8B
Reasoning, coding, agentic chat128KText
Qwen2.5-7B-Instruct
General chat, coding, reasoning128KText
Qwen2.5-32B-Instruct
General chat, coding, reasoning128KText
Qwen2.5-14B-Instruct
General chat, coding, reasoning128KText
Qwen2.5-Coder-7B-Instruct
Code generation, code reasoning128KText
Qwen2.5-Coder-32B-Instruct
Code generation, fixing, agents128KText
Qwen3-235B-A22B
Reasoning, coding, agentic tasks128KText
Qwen3-30B-A3B-Instruct-2507
General chat, reasoning, agents256KText
Qwen3-4B-Thinking-2507
Reasoning, math, coding256KText
WebWorld-8B
Web agents, multimodal reasoning40KText, Image
MiniCPM-V 4.6
5213 GB421KText, Image
anima
421 GB31KText
Qwen3-Coder-30B-A3B-Instruct
256KText
Qwen3-30B-A3B
128KText
Qwen3-14B
128KText
Qwen3-32B
128KText

At a glance

  • License: Qwen License (commercial use, >100M MAU needs a separate license)
  • Context length: 128K tokens (131,072), up to 8K output
  • Parameters: 72.7B dense (70.0B non-embedding), 80 layers
  • Languages: 29+ languages
  • Minimum hardware: ~45-48 GB VRAM at 4-bit; multi-GPU at full precision
  • Strengths: coding, mathematics, instruction following, structured output

Overview

Qwen2.5-72B-Instruct is the largest instruction-tuned model in Alibaba Cloud's Qwen2.5 series, released in September 2024 by the Qwen team. It has 72.7 billion parameters (70.0B excluding embeddings) spread across 80 transformer layers, and uses a Qwen2 architecture with RoPE position embeddings, SwiGLU activations, RMSNorm, grouped-query attention (64 query heads, 8 key/value heads), and QKV bias. The model is the chat-tuned variant of the Qwen2.5-72B base model and is meant for direct deployment as an assistant.

What it's good at

Compared with Qwen2, this release adds noticeably more knowledge and sharper coding and mathematics, which the Qwen team credits to specialized expert models used during training. It follows instructions more reliably, writes long outputs past 8K tokens, reads structured data like tables, and produces clean structured output such as JSON. It handles function calling through a tool-call template, which suits agentic and API-driven workflows. Multilingual coverage spans more than 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Russian, Japanese, Korean, and Arabic.

Running locally

The weights run with Hugging Face transformers (4.37.0 or newer). At full BF16 precision the 72B weights need around 145 GB, so unquantized inference typically uses two or more high-memory GPUs; vLLM with tensor parallelism is the recommended serving path. With 4-bit quantization (GPTQ, AWQ, or GGUF via llama.cpp and Ollama) memory drops to roughly 45-48 GB, which fits a single 48 GB card. The default context is 32,768 tokens; reaching the full 131,072-token window requires enabling YaRN rope scaling, which Qwen suggests turning on only for genuinely long inputs.

License

Qwen2.5-72B-Instruct is distributed under the Qwen License, not Apache 2.0. The weights are free to download and the license permits commercial use, but products with more than 100 million monthly active users must obtain a separate license from Alibaba Cloud. Review the license text before shipping the model in a large-scale product.

Desktop
macOS
(M1 or better)
Download
Windows
(x64)
Download
Linux
(x86_64)
Download

Frequently asked questions

Qwen2.5-72B-Instruct is a 72.7-billion-parameter instruction-tuned large language model from the Qwen team at Alibaba Cloud, released in September 2024. It is a causal (decoder-only) transformer post-trained for chat, coding, math, and structured-output tasks, and it supports a context window of up to 128K tokens.

At full BF16/FP16 precision the 72B weights need roughly 145 GB of memory, so running unquantized usually means multiple GPUs (for example 2x A100 80GB). With 4-bit quantization the model fits in about 45-48 GB of VRAM, which makes a single 48 GB card or two 24 GB cards workable. For long 128K-token contexts you also need extra memory for the KV cache.

The weights are openly available to download from Hugging Face at no cost. The 72B model is released under the Qwen License rather than a standard permissive license such as Apache 2.0. The Qwen License allows commercial use, but products or services with more than 100 million monthly active users must request a separate license from Alibaba Cloud.

Qwen2.5-72B-Instruct supports more than 29 languages. These include Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, and Arabic. This broad multilingual coverage makes it suitable for translation, cross-language chat, and content generation outside English.

The model supports a full context length of 131,072 tokens (about 128K) and can generate up to 8,192 tokens in a single response. The shipped config.json defaults to 32,768 tokens; to use the full 128K window you enable YaRN rope scaling, which Qwen recommends only when long-context input is actually needed since it can reduce quality on short prompts.