Qwen3-8B

Updated

Thinking

Tools

Reasoning

Code

Multilingual

Run

An 8.2B dense LLM from Alibaba's Qwen3 series with switchable thinking mode, strong reasoning, coding, and 100+ language support.

pip install -U "transformers>=4.51.0"
huggingface-cli download Qwen/Qwen3-8B
# or with Ollama:
ollama run qwen3:8b

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
messages = [{"role": "user", "content": "Give me a short intro to LLMs."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=2048)

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "sk-local" });
const res = await client.chat.completions.create({
  model: "Qwen/Qwen3-8B",
  messages: [{ role: "user", content: "Hello!" }]
});
console.log(res.choices[0].message.content);

More models

View all

Name	Size / Usage	Context	Input
Qwen3.6-35B-A3B		256K	Text, Image
Qwen3.6-27B		256K	Text, Image
Qwen2.5-7B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-32B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-14B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-Coder-7B-Instruct	Code generation, code reasoning	128K	Text
Qwen2.5-Coder-32B-Instruct	Code generation, fixing, agents	128K	Text
Qwen3-235B-A22B	Reasoning, coding, agentic tasks	128K	Text
Qwen3-30B-A3B-Instruct-2507	General chat, reasoning, agents	256K	Text
Qwen2.5-72B-Instruct	General chat, coding, reasoning	128K	Text
Qwen3-4B-Thinking-2507	Reasoning, math, coding	256K	Text
WebWorld-8B	Web agents, multimodal reasoning	40K	Text, Image
MiniCPM-V 4.6	5213 GB	421K	Text, Image
anima	421 GB	31K	Text
Qwen3-Coder-30B-A3B-Instruct		256K	Text
Qwen3-30B-A3B		128K	Text
Qwen3-14B		128K	Text
Qwen3-32B		128K	Text

At a glance

License: Apache 2.0
Parameters: 8.2B (dense, 36 layers)
Context length: 32K native, 128K with YaRN
Languages: 100+ languages and dialects
Minimum hardware: ~8 GB VRAM at 4-bit
Strengths: switchable reasoning, coding, agentic tool use

Overview

Qwen3-8B is a dense, 8.2-billion-parameter language model from Alibaba's Qwen team, part of the Qwen3 family released in 2025. It has 36 layers and uses grouped-query attention with 32 query heads and 8 key/value heads. Like the rest of the Qwen3 lineup, it is fine-tuned from a base checkpoint (Qwen3-8B-Base) for chat, reasoning, and agentic use. The headline feature is a single model that switches between a thinking mode for complex problems and a non-thinking mode for fast everyday dialogue.

What it's good at

In thinking mode, Qwen3-8B handles math, code generation, and multi-step logical reasoning, and the Qwen team reports it surpasses the earlier QwQ and Qwen2.5-instruct models on those tasks. It was trained with agent capabilities in mind, so it integrates with external tools and function calls in both modes and performs well on tool-use benchmarks for its size. It supports more than 100 languages and dialects, with solid multilingual instruction-following and translation. For general chat, role-play, and creative writing, the non-thinking mode gives quicker responses without the reasoning overhead.

Running locally

At 8.2B parameters the model is approachable for consumer hardware. A 4-bit quant (Q4_K_M) is around 5 GB and runs on an 8 GB GPU; Q8 needs roughly 9 GB. CPU-only inference works at Q4_K_M with 16 GB of RAM at a few tokens per second. The model runs in Hugging Face transformers (4.51.0 or newer), vLLM, llama.cpp, and Ollama via ollama run qwen3:8b. Native context is 32K tokens; YaRN scaling extends it to 131K, at the cost of extra KV-cache memory, and is best enabled only when you need it.

License

Qwen3-8B is released under Apache 2.0. That allows free commercial use, modification, and redistribution, with the standard requirement to keep the license and copyright notices intact. The weights are hosted openly on Hugging Face.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

Qwen3-8B is an 8.2-billion-parameter dense language model from Alibaba's Qwen team, released as part of the Qwen3 generation. It is a causal language model post-trained for chat, reasoning, and agentic tool use. A defining feature is its dual-mode design: it can switch between a thinking mode for math, coding, and logic, and a non-thinking mode for fast general dialogue, all within the same checkpoint.

At 4-bit quantization (Q4_K_M) the model weights take roughly 5 GB, so a GPU with 8 GB of VRAM can run it comfortably at standard context lengths. Q8 needs about 9 GB. CPU-only inference works at Q4_K_M with 16 GB of system RAM, though throughput drops to a few tokens per second. Extending context toward 128K with YaRN adds several GB of KV-cache memory on top of these figures.

Qwen3-8B has a native context window of 32,768 tokens. Using YaRN rope scaling it can be extended to 131,072 tokens (128K) for long-document and long-context tasks. The Qwen team recommends enabling YaRN only when you actually need the longer window, because it can slightly reduce quality on inputs shorter than 32K and increases KV-cache memory use.

Yes. Qwen3-8B is released under the Apache 2.0 license, which permits free use, modification, redistribution, and commercial deployment without paying royalties. The weights are openly available on Hugging Face, so you can download and run the model on your own hardware. Apache 2.0 requires you to preserve the license and copyright notices in redistributions.

Qwen3-8B supports over 100 languages and dialects, with strong multilingual instruction-following and translation ability. This wide coverage spans major European, Asian, and Middle Eastern languages, making it usable for cross-lingual chat and translation tasks well beyond English and Chinese.