SmolLM3-3B

Updated

Reasoning

Multilingual

Tools

Thinking

Run

A fully open 3B reasoning model from Hugging Face with dual-mode thinking, six native languages, tool calling, and 128K context.

pip install -U "transformers>=4.53.0"
huggingface-cli download HuggingFaceTB/SmolLM3-3B
# OpenAI-compatible server:
vllm serve HuggingFaceTB/SmolLM3-3B --enable-auto-tool-choice --tool-call-parser=hermes

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d {
    "model": "HuggingFaceTB/SmolLM3-3B",
    "messages": [{"role": "user", "content": "Give me a brief explanation of gravity."}],
    "temperature": 0.6,
    "top_p": 0.95,
    "chat_template_kwargs": {"enable_thinking": false}
  }

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "HuggingFaceTB/SmolLM3-3B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
messages = [{"role": "user", "content": "Explain gravity simply."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=512, temperature=0.6, top_p=0.95)
print(tokenizer.decode(out[0]))

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "EMPTY" });
const res = await client.chat.completions.create({
  model: "HuggingFaceTB/SmolLM3-3B",
  messages: [{ role: "user", content: "Explain gravity simply." }],
  temperature: 0.6,
  top_p: 0.95,
});
console.log(res.choices[0].message.content);

More models

View all

Name	Size / Usage	Context	Input
SmolLM2-135M-Instruct	On-device chat, rewriting	8K	Text

At a glance

License: Apache 2.0
Context length: 128K tokens (YaRN, from 64K trained)
Languages: 6 native (EN, FR, ES, DE, IT, PT) + AR, ZH, RU
Minimum hardware: ~8 GB VRAM (4-bit), runs on CPU via GGUF
Reasoning: dual-mode /think and /no_think
Strengths: instruction following, math reasoning, tool calling

Overview

SmolLM3-3B is a 3-billion-parameter language model from Hugging Face, released in 2025 as the third generation of the SmolLM family. It is a decoder-only transformer that uses grouped-query attention and NoPE in a 3:1 ratio, pretrained on roughly 11 trillion tokens across web, code, math, and reasoning data. The model was post-trained with 140B reasoning tokens, supervised fine-tuning, and Anchored Preference Optimization (APO). Hugging Face publishes the weights, the data mixture, and the training configs, so the whole pipeline is reproducible.

What it is good at

SmolLM3-3B ships with a hybrid reasoning design: you can toggle an extended thinking mode with /think or turn it off with /no_think. On instruction following it scores 76.7 on IFEval in no-think mode, ahead of Qwen2.5-3B and Llama3.1-3B. In thinking mode its math and reasoning scores climb sharply, for example AIME 2025 rising from 9.3 to 36.7 and GPQA Diamond reaching 41.7. It handles tool calling (92.3 BFCL) for agentic workflows and natively covers six languages: English, French, Spanish, German, Italian, and Portuguese, with additional Arabic, Chinese, and Russian data. Context reaches 128K tokens through YaRN extrapolation from a 64K training window.

Running locally

At 3B parameters the model is light. Full bf16 inference fits in about 6-8 GB of VRAM, and 4-bit quantization brings it under 8 GB so it runs on most consumer GPUs. The modeling code landed in transformers v4.53.0, and you can serve an OpenAI-compatible endpoint with vLLM or SGLang. For CPU, Apple Silicon, or edge use there are GGUF, ONNX, MLX, and ExecuTorch builds. Hugging Face recommends sampling at temperature 0.6 and top_p 0.95.

License

SmolLM3-3B is released under Apache 2.0. That allows commercial use, modification, and redistribution without a separate license fee, and there is no acceptable-use restriction beyond the standard Apache terms. As Hugging Face notes, outputs can still be inaccurate or biased, so the model should be treated as an assistive tool rather than an authoritative source.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

SmolLM3-3B is a 3-billion-parameter decoder-only language model released by Hugging Face in 2025. It is an instruction-tuned model with dual-mode reasoning, native support for six languages, and context handling up to 128K tokens.

Yes. SmolLM3-3B is released under the Apache 2.0 license, which permits commercial use, modification, and redistribution. Hugging Face also published the full training details, including the data mixture and training configs, making it a fully open model.

At 3B parameters, SmolLM3-3B runs on a GPU with roughly 8 GB of VRAM in 4-bit quantization, and full bf16 inference fits in about 6-8 GB. Quantized GGUF builds also run on CPU and on Apple Silicon via llama.cpp or MLX, making it usable on consumer laptops.

SmolLM3-3B was trained on a 64K context window and extends to 128K tokens using YaRN extrapolation. The shipped config.json is set to 65,536 tokens by default; you enable the longer window by adding the YaRN rope_scaling settings.

Both are small reasoning models with thinking modes. Qwen3-4B is larger and scores higher on most benchmarks such as AIME and GPQA, while SmolLM3-3B is smaller, fully open with published training data, and leads on instruction following (IFEval 76.7 in no-think mode). SmolLM3 supports six native languages versus Qwen3's broader multilingual coverage.