Overview
Qwen3-235B-A22B is the flagship model of the Qwen3 series, released in 2025 by Alibaba's Qwen team. It is a mixture-of-experts (MoE) causal language model with 235 billion total parameters, of which 22 billion are activated per token. The architecture has 94 layers and 128 experts, with 8 experts routed on each forward pass, and it uses grouped-query attention with 64 query heads and 4 key/value heads. The defining feature of Qwen3 is a single model that switches between a thinking mode for hard reasoning and a non-thinking mode for fast general dialogue.
What it's good at
In thinking mode the model emits a reasoning trace inside a <think> block before answering, which lifts its accuracy on mathematics, code generation, and logical problems above the earlier QwQ and Qwen2.5-Instruct models. Qwen reports that on benchmark suites it competes with DeepSeek-R1, OpenAI o1, Grok-3, and Gemini-2.5-Pro. It handles 100+ languages and dialects with solid translation and multilingual instruction following, and it is built for agentic work, with strong tool calling that pairs well with the Qwen-Agent framework and MCP servers.
Running locally
Because all 235B parameters stay resident even though only 22B compute per token, this model needs a lot of memory. In BF16 it spans hundreds of gigabytes and is usually served with 8-way tensor parallelism in vLLM (0.8.5+) or SGLang (0.4.6+) to expose an OpenAI-compatible endpoint. For smaller setups, 4-bit GGUF quantization through llama.cpp, Ollama, LMStudio, or MLX brings it down to roughly 48 GB of VRAM or more across one or several high-memory GPUs. Native context is 32,768 tokens, extendable to 131,072 with YaRN when long inputs are needed.
License
Qwen3-235B-A22B is released under the Apache 2.0 license. You can use, modify, fine-tune, and deploy it commercially without paying a fee, and there is no attribution requirement beyond keeping the license notice. The weights are openly downloadable from Hugging Face.


