Qwen3-30B-A3B-Instruct-2507

Updated

Reasoning

Code

Multilingual

Tools

Run

A 30.5B-parameter (3.3B active) MoE instruct model from Alibaba's Qwen3 series with 256K context and strong reasoning, coding, and tool use.

pip install -U "transformers>=4.51.0"
huggingface-cli download Qwen/Qwen3-30B-A3B-Instruct-2507
# Or serve an OpenAI-compatible endpoint with vLLM:
vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 --max-model-len 262144

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
    "messages": [{"role": "user", "content": "Explain mixture-of-experts in one paragraph."}],
    "temperature": 0.7,
    "top_p": 0.8
  }'

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
messages = [{"role": "user", "content": "Give me a short intro to MoE models."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=16384)
print(tokenizer.decode(out[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "EMPTY" });
const res = await client.chat.completions.create({
  model: "Qwen/Qwen3-30B-A3B-Instruct-2507",
  messages: [{ role: "user", content: "Explain mixture-of-experts in one paragraph." }],
  temperature: 0.7,
  top_p: 0.8,
});
console.log(res.choices[0].message.content);

More models

View all

Name	Size / Usage	Context	Input
Qwen3.6-35B-A3B		256K	Text, Image
Qwen3.6-27B		256K	Text, Image
Qwen3-8B	Reasoning, coding, agentic chat	128K	Text
Qwen2.5-7B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-32B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-14B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-Coder-7B-Instruct	Code generation, code reasoning	128K	Text
Qwen2.5-Coder-32B-Instruct	Code generation, fixing, agents	128K	Text
Qwen3-235B-A22B	Reasoning, coding, agentic tasks	128K	Text
Qwen2.5-72B-Instruct	General chat, coding, reasoning	128K	Text
Qwen3-4B-Thinking-2507	Reasoning, math, coding	256K	Text
WebWorld-8B	Web agents, multimodal reasoning	40K	Text, Image
MiniCPM-V 4.6	5213 GB	421K	Text, Image
anima	421 GB	31K	Text
Qwen3-Coder-30B-A3B-Instruct		256K	Text
Qwen3-30B-A3B		128K	Text
Qwen3-14B		128K	Text
Qwen3-32B		128K	Text

At a glance

License: Apache 2.0
Architecture: Mixture-of-experts, 30.5B total / 3.3B active (128 experts, 8 active)
Context length: 256K tokens native (up to ~1M with Dual Chunk Attention)
Mode: Non-thinking only (no <think> blocks)
Minimum hardware: ~24 GB VRAM at 4-bit
Strengths: reasoning, coding, tool calling, long-context, multilingual chat

Overview

Qwen3-30B-A3B-Instruct-2507 is a July 2025 refresh of Alibaba's Qwen3-30B-A3B model, tuned for instruction following rather than chain-of-thought reasoning. It is a mixture-of-experts model: 30.5B parameters in total, but each token only activates 3.3B of them across 8 of its 128 experts. That keeps inference cheap relative to its size. Unlike some siblings in the Qwen3 line, this checkpoint runs in non-thinking mode only and never emits <think> blocks, so you do not need to toggle a thinking flag.

What it's good at

Compared with the original Qwen3-30B-A3B, the 2507 update posts large gains across the board. It scores 78.4 on MMLU-Pro and 70.4 on GPQA, jumps to 61.3 on AIME25 math, and reaches 90.0 on ZebraLogic, beating much larger models on that logic test. Coding is solid too (83.8 on MultiPL-E, 43.2 on LiveCodeBench v6), and it handles tool calling well, which makes it a practical choice for agent workflows through frameworks like Qwen-Agent. It also improved on open-ended writing and multilingual long-tail knowledge.

Running locally

The MoE design means the active footprint is small, so a 4-bit quant fits on a single 24 GB GPU for everyday context lengths. You can serve it with vLLM or SGLang for an OpenAI-compatible API, or run it through Ollama, LM Studio, llama.cpp, or MLX-LM on a workstation. The native 256K context is memory-hungry; pushing toward the 1M-token configuration with Dual Chunk Attention needs roughly 240 GB of GPU memory, so most users cap context lower to avoid out-of-memory errors. Qwen recommends temperature 0.7, top-p 0.8, top-k 20.

License

The model ships under Apache 2.0. You can use it commercially, modify it, and redistribute it, with no royalty and only the standard requirement to preserve the copyright and license notices. The weights are openly downloadable from Hugging Face.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

Qwen3-30B-A3B-Instruct-2507 is an instruction-tuned language model from Alibaba's Qwen team, released in July 2025. It uses a mixture-of-experts (MoE) design with 30.5B total parameters but only 3.3B active per token (128 experts, 8 activated). It runs in non-thinking mode only, so it does not emit <think> reasoning blocks, and it improves on the original Qwen3-30B-A3B in instruction following, math, coding, and tool use.

Because only 3.3B of its 30.5B parameters are active per token, the model is lighter to run than a dense 30B. At a 4-bit quant the weights fit in roughly 18-20 GB, so a single 24 GB GPU (such as an RTX 4090) can run it with a moderate context window. Running at the full 256K context needs far more memory, and the 1M-token configuration requires about 240 GB of total GPU memory across multiple cards.

Yes. The model is released under the Apache 2.0 license, which permits commercial use, modification, and redistribution without paying royalties. The weights are published openly on Hugging Face, and the license requires only that you keep the copyright and license notices. You can self-host it or use it through API providers like OpenRouter and Fireworks AI.

Both share the same 30B-A3B MoE backbone, but they are tuned for different jobs. Qwen3-30B-A3B-Instruct-2507 is a general-purpose chat model strong across reasoning, math, writing, and multilingual tasks. The Coder variant is post-trained specifically for software engineering and agentic coding, so it tends to score higher on code benchmarks and tool-driven repo tasks while the Instruct model is the better all-rounder.

The model supports 262,144 tokens (256K) natively. Using Dual Chunk Attention and the MInference sparse-attention technique, it can be extended to roughly 1 million tokens with the provided config_1m.json, though that setup demands about 240 GB of GPU memory. On the RULER long-context benchmark it holds up well past 256K, scoring far better than the original Qwen3-30B-A3B at long ranges.