Overview
Qwen3-30B-A3B-Instruct-2507 is a July 2025 refresh of Alibaba's Qwen3-30B-A3B model, tuned for instruction following rather than chain-of-thought reasoning. It is a mixture-of-experts model: 30.5B parameters in total, but each token only activates 3.3B of them across 8 of its 128 experts. That keeps inference cheap relative to its size. Unlike some siblings in the Qwen3 line, this checkpoint runs in non-thinking mode only and never emits <think> blocks, so you do not need to toggle a thinking flag.
What it's good at
Compared with the original Qwen3-30B-A3B, the 2507 update posts large gains across the board. It scores 78.4 on MMLU-Pro and 70.4 on GPQA, jumps to 61.3 on AIME25 math, and reaches 90.0 on ZebraLogic, beating much larger models on that logic test. Coding is solid too (83.8 on MultiPL-E, 43.2 on LiveCodeBench v6), and it handles tool calling well, which makes it a practical choice for agent workflows through frameworks like Qwen-Agent. It also improved on open-ended writing and multilingual long-tail knowledge.
Running locally
The MoE design means the active footprint is small, so a 4-bit quant fits on a single 24 GB GPU for everyday context lengths. You can serve it with vLLM or SGLang for an OpenAI-compatible API, or run it through Ollama, LM Studio, llama.cpp, or MLX-LM on a workstation. The native 256K context is memory-hungry; pushing toward the 1M-token configuration with Dual Chunk Attention needs roughly 240 GB of GPU memory, so most users cap context lower to avoid out-of-memory errors. Qwen recommends temperature 0.7, top-p 0.8, top-k 20.
License
The model ships under Apache 2.0. You can use it commercially, modify it, and redistribute it, with no royalty and only the standard requirement to preserve the copyright and license notices. The weights are openly downloadable from Hugging Face.


