Qwen3-4B-Thinking-2507

Updated

Thinking

Reasoning

Code

Tools

Multilingual

Run

A 4B reasoning-focused LLM from Alibaba's Qwen3 series that always thinks step by step, with a 256K context and strong math, coding, and agentic scores.

pip install -U "transformers>=4.51.0"
huggingface-cli download Qwen/Qwen3-4B-Thinking-2507
# or via Ollama:
ollama run qwen3:4b-thinking-2507

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B-Thinking-2507",
    "messages": [{"role": "user", "content": "Prove that sqrt(2) is irrational."}],
    "temperature": 0.6,
    "top_p": 0.95
  }'

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-4B-Thinking-2507"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
messages = [{"role": "user", "content": "Solve: what is 17 * 23?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=32768)
print(tokenizer.decode(out[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "EMPTY" });
const res = await client.chat.completions.create({
  model: "Qwen/Qwen3-4B-Thinking-2507",
  messages: [{ role: "user", content: "Explain the proof of the Pythagorean theorem." }],
  temperature: 0.6,
  top_p: 0.95,
});
console.log(res.choices[0].message.content);

More models

View all

Name	Size / Usage	Context	Input
Qwen3.6-35B-A3B		256K	Text, Image
Qwen3.6-27B		256K	Text, Image
Qwen3-8B	Reasoning, coding, agentic chat	128K	Text
Qwen2.5-7B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-32B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-14B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-Coder-7B-Instruct	Code generation, code reasoning	128K	Text
Qwen2.5-Coder-32B-Instruct	Code generation, fixing, agents	128K	Text
Qwen3-235B-A22B	Reasoning, coding, agentic tasks	128K	Text
Qwen3-30B-A3B-Instruct-2507	General chat, reasoning, agents	256K	Text
Qwen2.5-72B-Instruct	General chat, coding, reasoning	128K	Text
WebWorld-8B	Web agents, multimodal reasoning	40K	Text, Image
MiniCPM-V 4.6	5213 GB	421K	Text, Image
anima	421 GB	31K	Text
Qwen3-Coder-30B-A3B-Instruct		256K	Text
Qwen3-30B-A3B		128K	Text
Qwen3-14B		128K	Text
Qwen3-32B		128K	Text

At a glance

License: Apache 2.0
Parameters: 4.0B (3.6B non-embedding), 36 layers
Context length: 256K tokens (262,144 native)
Mode: thinking only, always emits a reasoning trace
Minimum hardware: ~4-6 GB VRAM at 4-bit
Strengths: math, science, coding, tool use, multilingual

Overview

Qwen3-4B-Thinking-2507 is a 4-billion-parameter causal language model from Alibaba's Qwen team, released in July 2025 as part of the Qwen3 series. It carries 4.0B total parameters (3.6B excluding embeddings) across 36 layers and uses grouped-query attention with 32 query heads and 8 key/value heads. Unlike a general instruct model, this build runs only in thinking mode: the chat template injects a <think> tag so the model always produces an internal reasoning trace before its answer. The 2507 update extends both the depth of that reasoning and the native context window to 262,144 tokens.

What it's good at

For its size the model posts unusually strong reasoning numbers. It scores 81.3 on AIME25 and 55.5 on HMMT25 for competition math, 74.0 on MMLU-Pro and 65.8 on GPQA for knowledge, and 55.2 on LiveCodeBench v6 for coding. Agentic tool use is a clear focus, with 71.2 on BFCL-v3 and large gains across the TAU retail, airline, and telecom benchmarks versus the original Qwen3-4B. It also handles multilingual instruction following (77.3 on MultiIF) and works well with the Qwen-Agent framework for MCP and function-calling workflows.

Running locally

The model needs transformers 4.51.0 or newer, or an OpenAI-compatible server through vLLM 0.8.5+ or SGLang 0.4.6+. At 4-bit quantization it fits in roughly 4-6 GB of memory, so it runs on an 8 GB consumer GPU or an Apple Silicon Mac, and full bf16 weights are about 8 GB. Ollama, LM Studio, MLX-LM, and llama.cpp all support it. Because the model reasons at length, Qwen recommends a context above 131K and a 32,768-token output budget (81,920 for hard math or coding), with sampling at temperature 0.6, top-p 0.95, top-k 20.

License

Qwen3-4B-Thinking-2507 is released under Apache 2.0. That allows commercial use, modification, and redistribution without royalties, and it does not require sharing your own fine-tuned weights. Keeping the license notice is the main obligation.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

Qwen3-4B-Thinking-2507 is a 4-billion-parameter causal language model from Alibaba's Qwen team, released in 2507 (July 2025). It runs exclusively in thinking mode, generating an explicit reasoning trace inside <think> tags before the final answer. It is built for complex reasoning across math, science, coding, and agentic tool use.

At 4B parameters, the model runs on modest hardware. A 4-bit quantized GGUF needs roughly 4-6 GB of VRAM or unified memory, so it works on an 8 GB GPU or an Apple Silicon Mac. Full bf16 weights are about 8 GB. Note that because the model reasons at length, the team recommends a context window above 131K, which raises KV-cache memory use during long generations.

Yes. Qwen3-4B-Thinking-2507 is released under the Apache 2.0 license, which permits free use, modification, redistribution, and commercial deployment without royalties. You can download the weights from Hugging Face and run them locally through transformers, vLLM, SGLang, Ollama, LM Studio, or llama.cpp.

The model supports a native context length of 262,144 tokens (256K). This 2507 release specifically enhances long-context understanding over the original Qwen3-4B. The Qwen team recommends keeping the context above 131,072 tokens when possible, since the model often needs long token sequences for its extended reasoning.

The Thinking variant runs only in reasoning mode and always emits a chain-of-thought before answering, which makes it stronger on hard math, science, and coding problems (for example, 81.3 on AIME25). The Instruct-2507 sibling responds directly without a visible thinking trace, so it is faster and better suited to straightforward chat and instruction-following where deep reasoning is not needed.