Kimi-K2-Instruct

Updated

Tools

Code

Reasoning

Multilingual

Run

A 1T-parameter MoE chat model from Moonshot AI with 32B active parameters, built for agentic tool use and strong coding.

pip install -U transformers
huggingface-cli download moonshotai/Kimi-K2-Instruct

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "moonshotai/Kimi-K2-Instruct", "messages": [{"role": "user", "content": "Hello"}]}'

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("moonshotai/Kimi-K2-Instruct", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2-Instruct", trust_remote_code=True)

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "local" });
const res = await client.chat.completions.create({
  model: "moonshotai/Kimi-K2-Instruct",
  messages: [{ role: "user", content: "Hello" }]
});
console.log(res.choices[0].message.content);

More models

View all

Name	Size / Usage	Context	Input
Kimi-K2.7-Code		256K	Text, Image
Kimi-K2-Instruct-0905		256K	Text

At a glance

License: Modified MIT (commercial use allowed)
Architecture: Mixture-of-Experts, 1T total / 32B active
Context length: 128K tokens
Input: Text
Minimum hardware: multi-GPU server, hundreds of GB VRAM (fp8)
Strengths: agentic tool use, coding, reasoning

Overview

Kimi-K2-Instruct is the instruction-tuned chat model in Moonshot AI's Kimi K2 series, released in mid-2025. It is a mixture-of-experts (MoE) architecture with 1 trillion total parameters and 32 billion activated per token, drawn from 384 experts with 8 selected per token. The model was pre-trained on 15.5 trillion tokens using Moonshot's MuonClip optimizer, which the team used to scale Muon-based training without the loss instabilities that usually appear at this size. Moonshot calls it a reflex-grade model: it answers directly rather than producing long internal chain-of-thought.

What it's good at

The model is built around agentic work and coding. On SWE-bench Verified it reaches 65.8% pass@1 with bash and editor tools on single-attempt patches, and 47.3% on SWE-bench Multilingual under the same setup. It has native tool calling: you supply the list of available tools in each request and the model decides when and how to invoke them, which makes it a fit for autonomous agents. It also performs well on knowledge, math, and general reasoning benchmarks. Because it skips extended thinking, latency tends to be lower than reasoning-first models, at the cost of step-by-step deliberation on the hardest problems.

Running locally

Self-hosting is heavy. The weights ship in block-fp8 format and the full 1T-parameter model still needs roughly a terabyte of storage and a multi-GPU server with hundreds of gigabytes of combined VRAM. It runs through frameworks such as vLLM, SGLang, and TensorRT-LLM, and the architecture is DeepseekV3-compatible so existing MoE serving stacks work with minor config. For most users a hosted endpoint or Moonshot's OpenAI/Anthropic-compatible API is the practical path rather than local inference.

License

Both the code and the weights are released under a Modified MIT License. It allows commercial use and redistribution. The one added term is an attribution requirement that applies to very large commercial deployments, so most projects can use it freely while large-scale products must display Kimi K2 attribution.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

Kimi-K2-Instruct is a post-trained, instruction-tuned large language model from Moonshot AI. It is a mixture-of-experts model with 1 trillion total parameters and 32 billion activated per token, tuned for general-purpose chat and agentic tool use. Moonshot describes it as a reflex-grade model without long chain-of-thought thinking.

Running Kimi-K2-Instruct locally is demanding because the full model has 1 trillion parameters. The weights ship in block-fp8 format and still occupy roughly 1 TB, so a single multi-GPU server with hundreds of gigabytes of combined VRAM is needed even at fp8. Most people run it through hosted inference providers or Moonshot's own API rather than on personal hardware.

Yes. Moonshot AI released both the code and the model weights under a Modified MIT License, and the checkpoints are available on Hugging Face. The license permits commercial use; the main added condition is an attribution clause that applies to very large-scale commercial deployments. You can also use it through Moonshot's paid API.

Kimi-K2-Instruct supports a 128K-token context window. It uses Multi-head Latent Attention (MLA) across 61 layers, which helps keep long-context inference efficient relative to the model's scale.

Kimi-K2-Instruct is tuned specifically for coding and tool use. It scores 65.8% pass@1 on SWE-bench Verified and 47.3% on SWE-bench Multilingual with bash and editor tools, single attempt and no test-time compute. It also has native tool-calling: you pass the available tools in each request and the model decides when to invoke them.