Qwen2.5-72B-Instruct

Updated

Reasoning

Code

Multilingual

Tools

Run

A 72.7B instruction-tuned LLM from Alibaba's Qwen2.5 series with strong coding, math and multilingual ability across 29+ languages.

pip install -U transformers
huggingface-cli download Qwen/Qwen2.5-72B-Instruct
# Or serve with vLLM:
vllm serve Qwen/Qwen2.5-72B-Instruct --tensor-parallel-size 2

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-72B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-72B-Instruct", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-72B-Instruct")
messages = [{"role": "user", "content": "Give me a short intro to LLMs."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(out[0], skip_special_tokens=True))

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "EMPTY" });
const res = await client.chat.completions.create({
  model: "Qwen/Qwen2.5-72B-Instruct",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(res.choices[0].message.content);

More models

View all

Name	Size / Usage	Context	Input
Qwen3.6-35B-A3B		256K	Text, Image
Qwen3.6-27B		256K	Text, Image
Qwen3-8B	Reasoning, coding, agentic chat	128K	Text
Qwen2.5-7B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-32B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-14B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-Coder-7B-Instruct	Code generation, code reasoning	128K	Text
Qwen2.5-Coder-32B-Instruct	Code generation, fixing, agents	128K	Text
Qwen3-235B-A22B	Reasoning, coding, agentic tasks	128K	Text
Qwen3-30B-A3B-Instruct-2507	General chat, reasoning, agents	256K	Text
Qwen3-4B-Thinking-2507	Reasoning, math, coding	256K	Text
WebWorld-8B	Web agents, multimodal reasoning	40K	Text, Image
MiniCPM-V 4.6	5213 GB	421K	Text, Image
anima	421 GB	31K	Text
Qwen3-Coder-30B-A3B-Instruct		256K	Text
Qwen3-30B-A3B		128K	Text
Qwen3-14B		128K	Text
Qwen3-32B		128K	Text

At a glance

License: Qwen License (commercial use, >100M MAU needs a separate license)
Context length: 128K tokens (131,072), up to 8K output
Parameters: 72.7B dense (70.0B non-embedding), 80 layers
Languages: 29+ languages
Minimum hardware: ~45-48 GB VRAM at 4-bit; multi-GPU at full precision
Strengths: coding, mathematics, instruction following, structured output

Overview

Qwen2.5-72B-Instruct is the largest instruction-tuned model in Alibaba Cloud's Qwen2.5 series, released in September 2024 by the Qwen team. It has 72.7 billion parameters (70.0B excluding embeddings) spread across 80 transformer layers, and uses a Qwen2 architecture with RoPE position embeddings, SwiGLU activations, RMSNorm, grouped-query attention (64 query heads, 8 key/value heads), and QKV bias. The model is the chat-tuned variant of the Qwen2.5-72B base model and is meant for direct deployment as an assistant.

What it's good at

Compared with Qwen2, this release adds noticeably more knowledge and sharper coding and mathematics, which the Qwen team credits to specialized expert models used during training. It follows instructions more reliably, writes long outputs past 8K tokens, reads structured data like tables, and produces clean structured output such as JSON. It handles function calling through a tool-call template, which suits agentic and API-driven workflows. Multilingual coverage spans more than 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Russian, Japanese, Korean, and Arabic.

Running locally

The weights run with Hugging Face transformers (4.37.0 or newer). At full BF16 precision the 72B weights need around 145 GB, so unquantized inference typically uses two or more high-memory GPUs; vLLM with tensor parallelism is the recommended serving path. With 4-bit quantization (GPTQ, AWQ, or GGUF via llama.cpp and Ollama) memory drops to roughly 45-48 GB, which fits a single 48 GB card. The default context is 32,768 tokens; reaching the full 131,072-token window requires enabling YaRN rope scaling, which Qwen suggests turning on only for genuinely long inputs.

License

Qwen2.5-72B-Instruct is distributed under the Qwen License, not Apache 2.0. The weights are free to download and the license permits commercial use, but products with more than 100 million monthly active users must obtain a separate license from Alibaba Cloud. Review the license text before shipping the model in a large-scale product.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

Qwen2.5-72B-Instruct is a 72.7-billion-parameter instruction-tuned large language model from the Qwen team at Alibaba Cloud, released in September 2024. It is a causal (decoder-only) transformer post-trained for chat, coding, math, and structured-output tasks, and it supports a context window of up to 128K tokens.

At full BF16/FP16 precision the 72B weights need roughly 145 GB of memory, so running unquantized usually means multiple GPUs (for example 2x A100 80GB). With 4-bit quantization the model fits in about 45-48 GB of VRAM, which makes a single 48 GB card or two 24 GB cards workable. For long 128K-token contexts you also need extra memory for the KV cache.

The weights are openly available to download from Hugging Face at no cost. The 72B model is released under the Qwen License rather than a standard permissive license such as Apache 2.0. The Qwen License allows commercial use, but products or services with more than 100 million monthly active users must request a separate license from Alibaba Cloud.

Qwen2.5-72B-Instruct supports more than 29 languages. These include Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, and Arabic. This broad multilingual coverage makes it suitable for translation, cross-language chat, and content generation outside English.

The model supports a full context length of 131,072 tokens (about 128K) and can generate up to 8,192 tokens in a single response. The shipped config.json defaults to 32,768 tokens; to use the full 128K window you enable YaRN rope scaling, which Qwen recommends only when long-context input is actually needed since it can reduce quality on short prompts.