Qwen2.5-14B-Instruct

Updated

Reasoning

Code

Multilingual

Tools

Run

A 14.7B instruction-tuned LLM from Alibaba's Qwen2.5 series with strong coding, math, and 29+ language support.

pip install -U transformers
huggingface-cli download Qwen/Qwen2.5-14B-Instruct
# or serve with vLLM:
vllm serve Qwen/Qwen2.5-14B-Instruct

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-14B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-14B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [{"role": "user", "content": "Give me a short intro to LLMs."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "not-needed" });
const res = await client.chat.completions.create({
  model: "Qwen/Qwen2.5-14B-Instruct",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(res.choices[0].message.content);

More models

View all

Name	Size / Usage	Context	Input
Qwen3.6-35B-A3B		256K	Text, Image
Qwen3.6-27B		256K	Text, Image
Qwen3-8B	Reasoning, coding, agentic chat	128K	Text
Qwen2.5-7B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-32B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-Coder-7B-Instruct	Code generation, code reasoning	128K	Text
Qwen2.5-Coder-32B-Instruct	Code generation, fixing, agents	128K	Text
Qwen3-235B-A22B	Reasoning, coding, agentic tasks	128K	Text
Qwen3-30B-A3B-Instruct-2507	General chat, reasoning, agents	256K	Text
Qwen2.5-72B-Instruct	General chat, coding, reasoning	128K	Text
Qwen3-4B-Thinking-2507	Reasoning, math, coding	256K	Text
WebWorld-8B	Web agents, multimodal reasoning	40K	Text, Image
MiniCPM-V 4.6	5213 GB	421K	Text, Image
anima	421 GB	31K	Text
Qwen3-Coder-30B-A3B-Instruct		256K	Text
Qwen3-30B-A3B		128K	Text
Qwen3-14B		128K	Text
Qwen3-32B		128K	Text

At a glance

License: Apache 2.0
Context length: 128K tokens (8K output)
Parameters: 14.7B dense (13.1B non-embedding)
Languages: 29+ languages
Minimum hardware: ~12 GB VRAM at 4-bit, 16 GB comfortable
Strengths: coding, math, instruction following, structured output

Overview

Qwen2.5-14B-Instruct is an instruction-tuned large language model from Alibaba Cloud's Qwen team, released in September 2024 as part of the Qwen2.5 series. The series spans base and instruct models from 0.5B to 72B parameters; this 14.7B variant sits in the mid range, with 13.1B non-embedding parameters across 48 layers. It uses a causal Transformer design with RoPE, SwiGLU, RMSNorm, and grouped-query attention (40 query heads, 8 key/value heads).

What it's good at

Qwen2.5 was trained with specialized expert data for coding and mathematics, and the 14B instruct model carries those gains. It handles instruction following well, generates long text past 8K tokens, understands structured data such as tables, and reliably produces JSON. The model supports more than 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Japanese, Korean, and Arabic. Its chat template includes native tool/function-calling support, returning calls inside <tool_call> tags, which makes it usable for agentic workflows.

Running locally

At 4-bit quantization the model needs about 9 GB of memory, so a 12 GB GPU is the practical floor and 16 GB runs it with headroom; Q8 weights need roughly 15 GB and full FP16 around 28 GB. Apple Silicon Macs with 18 GB or more unified memory also work. It runs in Hugging Face transformers (4.37+), and for serving the Qwen team recommends vLLM; GGUF builds run in llama.cpp and Ollama. The shipped config caps context at 32,768 tokens, and the full 131,072-token window is enabled via YaRN rope scaling, which is best added only when long inputs are actually needed.

License

Qwen2.5-14B-Instruct is released under Apache 2.0. That permits commercial and private use, modification, and redistribution without per-token fees when self-hosted, and it does not require sharing your changes.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

Qwen2.5-14B-Instruct is an instruction-tuned large language model from Alibaba Cloud's Qwen team, part of the Qwen2.5 series released in September 2024. It has 14.7 billion parameters and is built on a causal Transformer architecture with RoPE, SwiGLU, RMSNorm, and GQA attention. It is tuned for chat, instruction following, coding, and math.

At 4-bit quantization (Q4_K_M) the 14.7B model needs roughly 9 GB of memory, so a 12 GB GPU is the practical minimum and 16 GB runs it comfortably with room for context. Higher quants need more: about 11 GB at Q5_K_M, 15 GB at Q8_0, and around 28 GB at full FP16. Apple Silicon Macs with 18 GB or more unified memory can also run it. Long contexts add several GB of KV cache on top.

Qwen2.5-14B-Instruct supports a context window of up to 131,072 tokens (128K) and can generate up to 8,192 tokens. By default the shipped config.json caps context at 32,768 tokens; to use the full 128K window you enable YaRN rope scaling, which the Qwen team recommends only when long inputs are actually needed.

Yes. Qwen2.5-14B-Instruct is released under the Apache 2.0 license, which permits free commercial and private use, modification, and redistribution. The weights are openly available to download from Hugging Face, with no per-token fees when you self-host.

Qwen2.5-14B-Instruct supports more than 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, and Arabic. The Qwen2.5 series also improved structured-data understanding and JSON output, which helps across multilingual tool-use and chat tasks.