Phi-4-mini-instruct

Updated

Reasoning

Code

Multilingual

Tools

Run

A 3.8B-parameter open instruct model from Microsoft's Phi-4 family with 128K context, strong math and reasoning, and function calling.

pip install -U transformers accelerate
huggingface-cli download microsoft/Phi-4-mini-instruct
# or run with Ollama:
ollama run phi4-mini

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/Phi-4-mini-instruct",
    "messages": [{"role": "user", "content": "What is 17 * 24?"}]
  }'

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-mini-instruct", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-instruct")
messages = [{"role": "user", "content": "Explain gradient descent in one paragraph."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(model.generate(inputs, max_new_tokens=256)[0]))

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "not-needed" });
const res = await client.chat.completions.create({
  model: "microsoft/Phi-4-mini-instruct",
  messages: [{ role: "user", content: "Write a haiku about compilers." }],
});
console.log(res.choices[0].message.content);

More models

View all

Name	Size / Usage	Context	Input
Phi-3.5-mini-instruct	General chat, multilingual	128K	Text
Phi-4	Reasoning, math, code	16K	Text

At a glance

License: MIT (commercial use allowed)
Parameters: 3.8B dense decoder-only
Context length: 128K tokens
Languages: 24 languages
Minimum hardware: ~4 GB VRAM (4-bit), runs on an 8 GB GPU
Strengths: math, reasoning, instruction following, function calling

Overview

Phi-4-mini-instruct is a 3.8-billion-parameter open language model released by Microsoft in February 2025 as part of the Phi-4 family. It is a dense, decoder-only Transformer trained on synthetic data and filtered web content, with a deliberate emphasis on reasoning-heavy material. Architecturally it differs from its Phi-3.5-mini predecessor through a larger 200K-token vocabulary, grouped-query attention, and shared input and output embeddings. The model was trained on 512 A100-80G GPUs between November and December 2024, with a data cutoff of June 2024, and supports a 128K-token context window.

What it's good at

For its size, Phi-4-mini-instruct is strong at math and structured reasoning. It scores 88.6 on GSM8K (8-shot CoT) and 64.0 on MATH, beating several larger 7B-9B models on those tasks, and reaches 70.4 on BigBench Hard. It handles 24 languages, follows instructions reliably thanks to supervised fine-tuning plus direct preference optimization, and adds proper function calling, where tools are declared as JSON in the system prompt. Microsoft is candid about the trade-off: a 3.8B model cannot store much factual knowledge, so it can be factually wrong on long-tail topics. Pairing it with retrieval (RAG) is the recommended fix.

Running locally

The small size makes local deployment easy. A 4-bit quantized build needs around 3 to 4 GB of VRAM, so it fits comfortably on an 8 GB consumer GPU, and the full-precision weights sit near 8 to 9 GB. It runs through Hugging Face transformers, vLLM, llama.cpp, and Ollama (as phi4-mini). The default transformers path uses flash attention and expects an Ampere-class or newer card; on V100 or older GPUs, load it with attn_implementation set to eager. Python 3.8 or 3.10 is recommended for the reference setup.

License

Phi-4-mini-instruct is released under the MIT license. That allows commercial and research use, modification, and redistribution with attribution and almost no other restrictions, which makes it one of the more permissive options among small instruct models.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

Phi-4-mini-instruct has 3.8 billion parameters. It is a dense decoder-only Transformer that uses grouped-query attention and a 200K-token vocabulary, which keeps it small enough to run on modest hardware while still matching the quality of several larger models on reasoning and math benchmarks.

Phi-4-mini-instruct supports a 128K-token context window. That is large enough to process long documents, multi-turn conversations, or sizeable code files in a single prompt, which is unusual for a model of its size.

Yes. Microsoft released Phi-4-mini-instruct under the MIT license, which permits commercial and research use, modification, and redistribution with minimal restrictions. The weights are downloadable for free from Hugging Face and are also available through Azure AI Foundry, Ollama, and NVIDIA NIM.

Because it has only 3.8B parameters, Phi-4-mini-instruct runs comfortably on consumer hardware. A 4-bit quantized build needs roughly 3 to 4 GB of VRAM, so an 8 GB GPU has plenty of headroom. The full-precision model fits in about 8 to 9 GB. It also runs on CPU through llama.cpp or Ollama, though more slowly. Note that the default build uses flash attention and expects an Ampere-class or newer GPU; on older cards load it with attn_implementation set to eager.

Yes. Function calling is a headline addition in the Phi-4-mini release. You declare the available tools as JSON inside the system prompt, wrapped in the model's tool tokens, and it returns structured calls. Microsoft notes that in some function-calling scenarios the model can occasionally hallucinate function names or URLs, so validating the returned calls before executing them is recommended.