Qwen3-235B-A22B

Updated

Thinking

Reasoning

Code

Multilingual

Tools

Run

A 235B mixture-of-experts LLM from Alibaba's Qwen3 series that activates 22B parameters and switches between thinking and non-thinking modes.

pip install -U "transformers>=4.51.0"
huggingface-cli download Qwen/Qwen3-235B-A22B
# serve with vLLM (8-way tensor parallel)
vllm serve Qwen/Qwen3-235B-A22B --enable-reasoning --reasoning-parser deepseek_r1 --tensor-parallel-size 8

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-235B-A22B",
    "messages": [{"role": "user", "content": "Explain mixture-of-experts in one paragraph."}],
    "temperature": 0.6,
    "top_p": 0.95
  }'

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-235B-A22B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
messages = [{"role": "user", "content": "Give me a short intro to LLMs."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=32768)
print(tokenizer.decode(out[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "EMPTY" });
const res = await client.chat.completions.create({
  model: "Qwen/Qwen3-235B-A22B",
  messages: [{ role: "user", content: "Explain mixture-of-experts in one paragraph." }],
  temperature: 0.6,
  top_p: 0.95,
});
console.log(res.choices[0].message.content);

More models

View all

Name	Size / Usage	Context	Input
Qwen3.6-35B-A3B		256K	Text, Image
Qwen3.6-27B		256K	Text, Image
Qwen3-8B	Reasoning, coding, agentic chat	128K	Text
Qwen2.5-7B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-32B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-14B-Instruct	General chat, coding, reasoning	128K	Text
Qwen2.5-Coder-7B-Instruct	Code generation, code reasoning	128K	Text
Qwen2.5-Coder-32B-Instruct	Code generation, fixing, agents	128K	Text
Qwen3-30B-A3B-Instruct-2507	General chat, reasoning, agents	256K	Text
Qwen2.5-72B-Instruct	General chat, coding, reasoning	128K	Text
Qwen3-4B-Thinking-2507	Reasoning, math, coding	256K	Text
WebWorld-8B	Web agents, multimodal reasoning	40K	Text, Image
MiniCPM-V 4.6	5213 GB	421K	Text, Image
anima	421 GB	31K	Text
Qwen3-Coder-30B-A3B-Instruct		256K	Text
Qwen3-30B-A3B		128K	Text
Qwen3-14B		128K	Text
Qwen3-32B		128K	Text

At a glance

License: Apache 2.0
Parameters: 235B total, 22B active (128 experts, 8 active)
Context length: 32K native, up to 128K with YaRN
Languages: 100+ languages and dialects
Minimum hardware: 48 GB+ VRAM (4-bit, multi-GPU)
Strengths: reasoning, coding, multilingual chat, tool calling

Overview

Qwen3-235B-A22B is the flagship model of the Qwen3 series, released in 2025 by Alibaba's Qwen team. It is a mixture-of-experts (MoE) causal language model with 235 billion total parameters, of which 22 billion are activated per token. The architecture has 94 layers and 128 experts, with 8 experts routed on each forward pass, and it uses grouped-query attention with 64 query heads and 4 key/value heads. The defining feature of Qwen3 is a single model that switches between a thinking mode for hard reasoning and a non-thinking mode for fast general dialogue.

What it's good at

In thinking mode the model emits a reasoning trace inside a <think> block before answering, which lifts its accuracy on mathematics, code generation, and logical problems above the earlier QwQ and Qwen2.5-Instruct models. Qwen reports that on benchmark suites it competes with DeepSeek-R1, OpenAI o1, Grok-3, and Gemini-2.5-Pro. It handles 100+ languages and dialects with solid translation and multilingual instruction following, and it is built for agentic work, with strong tool calling that pairs well with the Qwen-Agent framework and MCP servers.

Running locally

Because all 235B parameters stay resident even though only 22B compute per token, this model needs a lot of memory. In BF16 it spans hundreds of gigabytes and is usually served with 8-way tensor parallelism in vLLM (0.8.5+) or SGLang (0.4.6+) to expose an OpenAI-compatible endpoint. For smaller setups, 4-bit GGUF quantization through llama.cpp, Ollama, LMStudio, or MLX brings it down to roughly 48 GB of VRAM or more across one or several high-memory GPUs. Native context is 32,768 tokens, extendable to 131,072 with YaRN when long inputs are needed.

License

Qwen3-235B-A22B is released under the Apache 2.0 license. You can use, modify, fine-tune, and deploy it commercially without paying a fee, and there is no attribution requirement beyond keeping the license notice. The weights are openly downloadable from Hugging Face.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

Qwen3-235B-A22B is a mixture-of-experts (MoE) large language model from Alibaba's Qwen team. It has 235 billion total parameters but activates only 22 billion per token, using 8 of its 128 experts on each forward pass. It is the flagship model of the Qwen3 series, released in 2025, and supports both a thinking mode for reasoning and a non-thinking mode for general dialogue.

Running Qwen3-235B-A22B locally requires substantial GPU memory because all 235 billion parameters must be held in memory even though only 22 billion are active per token. In BF16 it needs several hundred gigabytes of VRAM across multiple GPUs, so it is typically served with tensor parallelism over 8 GPUs in frameworks like vLLM or SGLang. With 4-bit quantization (GGUF) it can run on high-memory workstations or multi-GPU setups, generally needing 48 GB of VRAM or more.

Yes. Qwen3-235B-A22B is released under the Apache 2.0 license, which permits free use, modification, and commercial deployment with no fees. The weights are available to download on Hugging Face. You can run it yourself on your own hardware or access it through hosted API providers such as OpenRouter and Alibaba Model Studio, which charge for their compute.

Qwen3-235B-A22B natively supports a context length of 32,768 tokens. Using the YaRN RoPE-scaling method, it can be extended to 131,072 tokens (128K), which Qwen has validated for long-document and long-conversation use. YaRN is supported by transformers, llama.cpp, vLLM, and SGLang, but the Qwen team recommends enabling it only when you actually need long contexts, since static YaRN can slightly degrade performance on shorter inputs.

Qwen3-235B-A22B can switch between two modes inside a single model. In thinking mode, enabled by default, it produces a reasoning trace wrapped in a <think>...</think> block before its final answer, which improves results on math, code, and logic. In non-thinking mode it replies directly without that block, behaving like a standard instruct model for faster general chat. You can toggle modes with the enable_thinking flag or by adding /think and /no_think to your prompt.