Overview
Qwen3-8B is a dense, 8.2-billion-parameter language model from Alibaba's Qwen team, part of the Qwen3 family released in 2025. It has 36 layers and uses grouped-query attention with 32 query heads and 8 key/value heads. Like the rest of the Qwen3 lineup, it is fine-tuned from a base checkpoint (Qwen3-8B-Base) for chat, reasoning, and agentic use. The headline feature is a single model that switches between a thinking mode for complex problems and a non-thinking mode for fast everyday dialogue.
What it's good at
In thinking mode, Qwen3-8B handles math, code generation, and multi-step logical reasoning, and the Qwen team reports it surpasses the earlier QwQ and Qwen2.5-instruct models on those tasks. It was trained with agent capabilities in mind, so it integrates with external tools and function calls in both modes and performs well on tool-use benchmarks for its size. It supports more than 100 languages and dialects, with solid multilingual instruction-following and translation. For general chat, role-play, and creative writing, the non-thinking mode gives quicker responses without the reasoning overhead.
Running locally
At 8.2B parameters the model is approachable for consumer hardware. A 4-bit quant (Q4_K_M) is around 5 GB and runs on an 8 GB GPU; Q8 needs roughly 9 GB. CPU-only inference works at Q4_K_M with 16 GB of RAM at a few tokens per second. The model runs in Hugging Face transformers (4.51.0 or newer), vLLM, llama.cpp, and Ollama via ollama run qwen3:8b. Native context is 32K tokens; YaRN scaling extends it to 131K, at the cost of extra KV-cache memory, and is best enabled only when you need it.
License
Qwen3-8B is released under Apache 2.0. That allows free commercial use, modification, and redistribution, with the standard requirement to keep the license and copyright notices intact. The weights are hosted openly on Hugging Face.


