Qwen3-4B-Thinking-2507

Updated
27.06.2026
Thinking
Reasoning
Code
Tools
Multilingual

A 4B reasoning-focused LLM from Alibaba's Qwen3 series that always thinks step by step, with a 256K context and strong math, coding, and agentic scores.

pip install -U "transformers>=4.51.0"
huggingface-cli download Qwen/Qwen3-4B-Thinking-2507
# or via Ollama:
ollama run qwen3:4b-thinking-2507
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B-Thinking-2507",
    "messages": [{"role": "user", "content": "Prove that sqrt(2) is irrational."}],
    "temperature": 0.6,
    "top_p": 0.95
  }'
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-4B-Thinking-2507"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
messages = [{"role": "user", "content": "Solve: what is 17 * 23?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=32768)
print(tokenizer.decode(out[0][len(inputs.input_ids[0]):], skip_special_tokens=True))
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "EMPTY" });
const res = await client.chat.completions.create({
  model: "Qwen/Qwen3-4B-Thinking-2507",
  messages: [{ role: "user", content: "Explain the proof of the Pythagorean theorem." }],
  temperature: 0.6,
  top_p: 0.95,
});
console.log(res.choices[0].message.content);

More models

NameSize / UsageContextInput
Qwen3.6-35B-A3B
256KText, Image
Qwen3.6-27B
256KText, Image
Qwen3-8B
Reasoning, coding, agentic chat128KText
Qwen2.5-7B-Instruct
General chat, coding, reasoning128KText
Qwen2.5-32B-Instruct
General chat, coding, reasoning128KText
Qwen2.5-14B-Instruct
General chat, coding, reasoning128KText
Qwen2.5-Coder-7B-Instruct
Code generation, code reasoning128KText
Qwen2.5-Coder-32B-Instruct
Code generation, fixing, agents128KText
Qwen3-235B-A22B
Reasoning, coding, agentic tasks128KText
Qwen3-30B-A3B-Instruct-2507
General chat, reasoning, agents256KText
Qwen2.5-72B-Instruct
General chat, coding, reasoning128KText
WebWorld-8B
Web agents, multimodal reasoning40KText, Image
MiniCPM-V 4.6
5213 GB421KText, Image
anima
421 GB31KText
Qwen3-Coder-30B-A3B-Instruct
256KText
Qwen3-30B-A3B
128KText
Qwen3-14B
128KText
Qwen3-32B
128KText

At a glance

  • License: Apache 2.0
  • Parameters: 4.0B (3.6B non-embedding), 36 layers
  • Context length: 256K tokens (262,144 native)
  • Mode: thinking only, always emits a reasoning trace
  • Minimum hardware: ~4-6 GB VRAM at 4-bit
  • Strengths: math, science, coding, tool use, multilingual

Overview

Qwen3-4B-Thinking-2507 is a 4-billion-parameter causal language model from Alibaba's Qwen team, released in July 2025 as part of the Qwen3 series. It carries 4.0B total parameters (3.6B excluding embeddings) across 36 layers and uses grouped-query attention with 32 query heads and 8 key/value heads. Unlike a general instruct model, this build runs only in thinking mode: the chat template injects a <think> tag so the model always produces an internal reasoning trace before its answer. The 2507 update extends both the depth of that reasoning and the native context window to 262,144 tokens.

What it's good at

For its size the model posts unusually strong reasoning numbers. It scores 81.3 on AIME25 and 55.5 on HMMT25 for competition math, 74.0 on MMLU-Pro and 65.8 on GPQA for knowledge, and 55.2 on LiveCodeBench v6 for coding. Agentic tool use is a clear focus, with 71.2 on BFCL-v3 and large gains across the TAU retail, airline, and telecom benchmarks versus the original Qwen3-4B. It also handles multilingual instruction following (77.3 on MultiIF) and works well with the Qwen-Agent framework for MCP and function-calling workflows.

Running locally

The model needs transformers 4.51.0 or newer, or an OpenAI-compatible server through vLLM 0.8.5+ or SGLang 0.4.6+. At 4-bit quantization it fits in roughly 4-6 GB of memory, so it runs on an 8 GB consumer GPU or an Apple Silicon Mac, and full bf16 weights are about 8 GB. Ollama, LM Studio, MLX-LM, and llama.cpp all support it. Because the model reasons at length, Qwen recommends a context above 131K and a 32,768-token output budget (81,920 for hard math or coding), with sampling at temperature 0.6, top-p 0.95, top-k 20.

License

Qwen3-4B-Thinking-2507 is released under Apache 2.0. That allows commercial use, modification, and redistribution without royalties, and it does not require sharing your own fine-tuned weights. Keeping the license notice is the main obligation.

Desktop
macOS
(M1 or better)
Download
Windows
(x64)
Download
Linux
(x86_64)
Download

Frequently asked questions

Qwen3-4B-Thinking-2507 is a 4-billion-parameter causal language model from Alibaba's Qwen team, released in 2507 (July 2025). It runs exclusively in thinking mode, generating an explicit reasoning trace inside <think> tags before the final answer. It is built for complex reasoning across math, science, coding, and agentic tool use.

At 4B parameters, the model runs on modest hardware. A 4-bit quantized GGUF needs roughly 4-6 GB of VRAM or unified memory, so it works on an 8 GB GPU or an Apple Silicon Mac. Full bf16 weights are about 8 GB. Note that because the model reasons at length, the team recommends a context window above 131K, which raises KV-cache memory use during long generations.

Yes. Qwen3-4B-Thinking-2507 is released under the Apache 2.0 license, which permits free use, modification, redistribution, and commercial deployment without royalties. You can download the weights from Hugging Face and run them locally through transformers, vLLM, SGLang, Ollama, LM Studio, or llama.cpp.

The model supports a native context length of 262,144 tokens (256K). This 2507 release specifically enhances long-context understanding over the original Qwen3-4B. The Qwen team recommends keeping the context above 131,072 tokens when possible, since the model often needs long token sequences for its extended reasoning.

The Thinking variant runs only in reasoning mode and always emits a chain-of-thought before answering, which makes it stronger on hard math, science, and coding problems (for example, 81.3 on AIME25). The Instruct-2507 sibling responds directly without a visible thinking trace, so it is faster and better suited to straightforward chat and instruction-following where deep reasoning is not needed.