Overview
Qwen3-4B-Thinking-2507 is a 4-billion-parameter causal language model from Alibaba's Qwen team, released in July 2025 as part of the Qwen3 series. It carries 4.0B total parameters (3.6B excluding embeddings) across 36 layers and uses grouped-query attention with 32 query heads and 8 key/value heads. Unlike a general instruct model, this build runs only in thinking mode: the chat template injects a <think> tag so the model always produces an internal reasoning trace before its answer. The 2507 update extends both the depth of that reasoning and the native context window to 262,144 tokens.
What it's good at
For its size the model posts unusually strong reasoning numbers. It scores 81.3 on AIME25 and 55.5 on HMMT25 for competition math, 74.0 on MMLU-Pro and 65.8 on GPQA for knowledge, and 55.2 on LiveCodeBench v6 for coding. Agentic tool use is a clear focus, with 71.2 on BFCL-v3 and large gains across the TAU retail, airline, and telecom benchmarks versus the original Qwen3-4B. It also handles multilingual instruction following (77.3 on MultiIF) and works well with the Qwen-Agent framework for MCP and function-calling workflows.
Running locally
The model needs transformers 4.51.0 or newer, or an OpenAI-compatible server through vLLM 0.8.5+ or SGLang 0.4.6+. At 4-bit quantization it fits in roughly 4-6 GB of memory, so it runs on an 8 GB consumer GPU or an Apple Silicon Mac, and full bf16 weights are about 8 GB. Ollama, LM Studio, MLX-LM, and llama.cpp all support it. Because the model reasons at length, Qwen recommends a context above 131K and a 32,768-token output budget (81,920 for hard math or coding), with sampling at temperature 0.6, top-p 0.95, top-k 20.
License
Qwen3-4B-Thinking-2507 is released under Apache 2.0. That allows commercial use, modification, and redistribution without royalties, and it does not require sharing your own fine-tuned weights. Keeping the license notice is the main obligation.


