MiniMax-M2.5

Updated
27.06.2026
Thinking
Tools
Reasoning
Code
Web

A 229B-parameter (10B active) MoE model from MiniMax built for agentic coding, tool use, and search, with a 200K context window.

pip install -U transformers
huggingface-cli download MiniMaxAI/MiniMax-M2.5
# serve with vLLM:
vllm serve MiniMaxAI/MiniMax-M2.5 --trust-remote-code --tensor-parallel-size 2
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMaxAI/MiniMax-M2.5",
    "messages": [{"role": "user", "content": "Refactor this function for readability."}],
    "temperature": 1.0,
    "top_p": 0.95
  }'
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("MiniMaxAI/MiniMax-M2.5", trust_remote_code=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("MiniMaxAI/MiniMax-M2.5", trust_remote_code=True)
messages = [{"role": "user", "content": "Write a Python function to reverse a linked list."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=512, temperature=1.0, top_p=0.95)
print(tokenizer.decode(out[0]))
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "EMPTY" });
const res = await client.chat.completions.create({
  model: "MiniMaxAI/MiniMax-M2.5",
  messages: [{ role: "user", content: "Plan the file structure for a CLI todo app." }],
  temperature: 1.0,
  top_p: 0.95,
});
console.log(res.choices[0].message.content);

More models

NameSize / UsageContextInput
MiniMax-M3
1MText, Image
MiniMax-M2.7
200KText

At a glance

  • License: Modified MIT (commercial use allowed)
  • Parameters: 229B total, ~10B active (MoE, 8 experts/token)
  • Context length: ~200K tokens (204,800)
  • Minimum hardware: ~96 GB+ combined memory; quantized 3-bit ~101 GB
  • Strengths: agentic coding, tool calling, web search, reasoning

Overview

MiniMax-M2.5 is an open-weight large language model released by MiniMax, the Shanghai-based AI lab behind the M2 series. It uses a Mixture-of-Experts design with 229B total parameters and about 10B active per token, routing each token through 8 of its experts. The model ships in fp8 on Hugging Face and arrived in early 2026 as the third release in the M2 family, following M2 and M2.1 within roughly three and a half months. A faster sibling, M2.5-Lightning, has identical capability but higher throughput.

What it's good at

M2.5 was trained with reinforcement learning across more than 200,000 real-world environments, and it targets agentic work rather than chat alone. It reports 80.2% on SWE-Bench Verified, 51.3% on Multi-SWE-Bench, and 76.3% on BrowseComp. Coding spans over ten languages, including Go, Rust, TypeScript, Python, Java, and C++, across web, mobile, and server projects. Before writing code it tends to plan like an architect, decomposing features and structure first. It also handles tool calling, web search, and office deliverables such as Word documents, slides, and Excel financial models. The model thinks step by step using a built-in reasoning trace and runs natively at up to 100 tokens per second.

Running locally

The 10B active parameters keep compute low, but all 229B weights still need to fit in memory, so the fp8 checkpoint is around 230 GB. A single 24 GB GPU cannot hold it; realistic local setups use 96 GB or more of combined VRAM and system RAM, such as 2x H100 80 GB or several consumer GPUs with CPU offload. Quantized GGUF builds help: a 3-bit quant is about 101 GB and runs near 25 tokens per second on an 80 GB H100. MiniMax recommends SGLang, vLLM, Transformers, or KTransformers for serving, with sampling at temperature 1.0, top_p 0.95, and top_k 40.

License

MiniMax-M2.5 is distributed under a Modified MIT license. You can download the weights, run them, fine-tune them, and use the model commercially. It is open-weight rather than fully open-source, since the training data and full pipeline are not published.

Desktop
macOS
(M1 or better)
Download
Windows
(x64)
Download
Linux
(x86_64)
Download

Frequently asked questions

MiniMax-M2.5 is a Mixture-of-Experts (MoE) model with 229B total parameters, of which roughly 10B are active per token. It activates 8 of its experts on each forward pass, which keeps inference cost low while retaining the capacity of a much larger dense model.

Despite activating only 10B parameters, MiniMax-M2.5 still has to hold all 229B weights in memory. The fp8 weights are about 230 GB, so a single 24 GB consumer GPU is not enough. Practical local setups need 96 GB or more of combined VRAM plus system RAM, for example 2x H100 80 GB or several consumer GPUs with CPU offload. Quantized GGUF builds shrink it: a 3-bit quant lands around 101 GB and runs at roughly 25 tokens per second on an 80 GB H100.

The weights are released on Hugging Face under a Modified MIT license, so you can download, run, fine-tune, and use the model commercially. It is open-weight rather than fully open-source, since the training data and full pipeline are not published. MiniMax also offers a hosted API where M2.5 costs about $0.30 per million input tokens and $2.40 per million output tokens.

MiniMax-M2.5 supports a context window of about 204,800 tokens (roughly 200K). That headroom suits long agentic runs, large codebases, and multi-document search tasks. For very long browsing sessions the model is designed to discard history when token usage gets high, which is how MiniMax reports its BrowseComp results.

MiniMax-M2.5 is built for agentic coding and tool use. It reports 80.2% on SWE-Bench Verified, 51.3% on Multi-SWE-Bench, and 76.3% on BrowseComp, and was trained across more than 200,000 real-world environments in over ten programming languages. It also handles office tasks such as Word, PowerPoint, and Excel financial modeling. The model thinks step by step before acting and runs natively at up to 100 tokens per second.