Qwen3-8B

Updated
27.06.2026
Thinking
Tools
Reasoning
Code
Multilingual

An 8.2B dense LLM from Alibaba's Qwen3 series with switchable thinking mode, strong reasoning, coding, and 100+ language support.

pip install -U "transformers>=4.51.0"
huggingface-cli download Qwen/Qwen3-8B
# or with Ollama:
ollama run qwen3:8b
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
messages = [{"role": "user", "content": "Give me a short intro to LLMs."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=2048)
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "sk-local" });
const res = await client.chat.completions.create({
  model: "Qwen/Qwen3-8B",
  messages: [{ role: "user", content: "Hello!" }]
});
console.log(res.choices[0].message.content);

More models

NameSize / UsageContextInput
Qwen3.6-35B-A3B
256KText, Image
Qwen3.6-27B
256KText, Image
Qwen2.5-7B-Instruct
General chat, coding, reasoning128KText
Qwen2.5-32B-Instruct
General chat, coding, reasoning128KText
Qwen2.5-14B-Instruct
General chat, coding, reasoning128KText
Qwen2.5-Coder-7B-Instruct
Code generation, code reasoning128KText
Qwen2.5-Coder-32B-Instruct
Code generation, fixing, agents128KText
Qwen3-235B-A22B
Reasoning, coding, agentic tasks128KText
Qwen3-30B-A3B-Instruct-2507
General chat, reasoning, agents256KText
Qwen2.5-72B-Instruct
General chat, coding, reasoning128KText
Qwen3-4B-Thinking-2507
Reasoning, math, coding256KText
WebWorld-8B
Web agents, multimodal reasoning40KText, Image
MiniCPM-V 4.6
5213 GB421KText, Image
anima
421 GB31KText
Qwen3-Coder-30B-A3B-Instruct
256KText
Qwen3-30B-A3B
128KText
Qwen3-14B
128KText
Qwen3-32B
128KText

At a glance

  • License: Apache 2.0
  • Parameters: 8.2B (dense, 36 layers)
  • Context length: 32K native, 128K with YaRN
  • Languages: 100+ languages and dialects
  • Minimum hardware: ~8 GB VRAM at 4-bit
  • Strengths: switchable reasoning, coding, agentic tool use

Overview

Qwen3-8B is a dense, 8.2-billion-parameter language model from Alibaba's Qwen team, part of the Qwen3 family released in 2025. It has 36 layers and uses grouped-query attention with 32 query heads and 8 key/value heads. Like the rest of the Qwen3 lineup, it is fine-tuned from a base checkpoint (Qwen3-8B-Base) for chat, reasoning, and agentic use. The headline feature is a single model that switches between a thinking mode for complex problems and a non-thinking mode for fast everyday dialogue.

What it's good at

In thinking mode, Qwen3-8B handles math, code generation, and multi-step logical reasoning, and the Qwen team reports it surpasses the earlier QwQ and Qwen2.5-instruct models on those tasks. It was trained with agent capabilities in mind, so it integrates with external tools and function calls in both modes and performs well on tool-use benchmarks for its size. It supports more than 100 languages and dialects, with solid multilingual instruction-following and translation. For general chat, role-play, and creative writing, the non-thinking mode gives quicker responses without the reasoning overhead.

Running locally

At 8.2B parameters the model is approachable for consumer hardware. A 4-bit quant (Q4_K_M) is around 5 GB and runs on an 8 GB GPU; Q8 needs roughly 9 GB. CPU-only inference works at Q4_K_M with 16 GB of RAM at a few tokens per second. The model runs in Hugging Face transformers (4.51.0 or newer), vLLM, llama.cpp, and Ollama via ollama run qwen3:8b. Native context is 32K tokens; YaRN scaling extends it to 131K, at the cost of extra KV-cache memory, and is best enabled only when you need it.

License

Qwen3-8B is released under Apache 2.0. That allows free commercial use, modification, and redistribution, with the standard requirement to keep the license and copyright notices intact. The weights are hosted openly on Hugging Face.

Desktop
macOS
(M1 or better)
Download
Windows
(x64)
Download
Linux
(x86_64)
Download

Frequently asked questions

Qwen3-8B is an 8.2-billion-parameter dense language model from Alibaba's Qwen team, released as part of the Qwen3 generation. It is a causal language model post-trained for chat, reasoning, and agentic tool use. A defining feature is its dual-mode design: it can switch between a thinking mode for math, coding, and logic, and a non-thinking mode for fast general dialogue, all within the same checkpoint.

At 4-bit quantization (Q4_K_M) the model weights take roughly 5 GB, so a GPU with 8 GB of VRAM can run it comfortably at standard context lengths. Q8 needs about 9 GB. CPU-only inference works at Q4_K_M with 16 GB of system RAM, though throughput drops to a few tokens per second. Extending context toward 128K with YaRN adds several GB of KV-cache memory on top of these figures.

Qwen3-8B has a native context window of 32,768 tokens. Using YaRN rope scaling it can be extended to 131,072 tokens (128K) for long-document and long-context tasks. The Qwen team recommends enabling YaRN only when you actually need the longer window, because it can slightly reduce quality on inputs shorter than 32K and increases KV-cache memory use.

Yes. Qwen3-8B is released under the Apache 2.0 license, which permits free use, modification, redistribution, and commercial deployment without paying royalties. The weights are openly available on Hugging Face, so you can download and run the model on your own hardware. Apache 2.0 requires you to preserve the license and copyright notices in redistributions.

Qwen3-8B supports over 100 languages and dialects, with strong multilingual instruction-following and translation ability. This wide coverage spans major European, Asian, and Middle Eastern languages, making it usable for cross-lingual chat and translation tasks well beyond English and Chinese.