SmolLM2-135M-Instruct

Updated

Reasoning

A 135M-parameter instruction-tuned LLM from Hugging Face's SmolLM2 family, small enough to run on CPU and on-device.

pip install -U transformers
huggingface-cli download HuggingFaceTB/SmolLM2-135M-Instruct

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "HuggingFaceTB/SmolLM2-135M-Instruct",
    "messages": [{"role": "user", "content": "Summarize this paragraph."}]
  }'

from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "HuggingFaceTB/SmolLM2-135M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)
messages = [{"role": "user", "content": "What is gravity?"}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
out = model.generate(inputs, max_new_tokens=50)
print(tokenizer.decode(out[0]))

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "local" });
const res = await client.chat.completions.create({
  model: "HuggingFaceTB/SmolLM2-135M-Instruct",
  messages: [{ role: "user", content: "Rewrite this sentence more formally." }]
});
console.log(res.choices[0].message.content);

More models

View all

Name	Size / Usage	Context	Input
SmolLM3-3B	Reasoning, multilingual chat, agents	128K	Text

At a glance

License: Apache 2.0
Context length: 8K tokens
Languages: English
Minimum hardware: runs on CPU (~720 MB)
Strengths: on-device chat, instruction following, rewriting, summarization

Overview

SmolLM2-135M-Instruct is the smallest model in Hugging Face's SmolLM2 family, which also includes 360M and 1.7B versions. It has 135 million parameters and uses a Llama-style transformer decoder. Hugging Face pretrained the base model on 2 trillion tokens drawn from FineWeb-Edu, DCLM, and The Stack, then produced this instruct variant through supervised fine-tuning followed by Direct Preference Optimization on the UltraFeedback dataset. The SmolLM2 work was published in early 2025 (arXiv:2502.02737).

What it's good at

The model is built for lightweight, on-device language tasks: short chat, instruction following, text rewriting, and summarization. Compared with the earlier SmolLM-135M-Instruct it improved sharply on instruction following, lifting IFEval from 17.2 to 29.9 and MT-Bench from 16.8 to 19.8, with gains on HellaSwag, ARC, and BBH as well. Its knowledge and math remain limited at this size, with GSM8K around 1.4, so it works best on simple, well-scoped prompts rather than open-ended reasoning. Function calling is reserved for the 1.7B variant, not this one.

Running locally

At 135M parameters the model needs roughly 720 MB in its default precision and runs on CPU, which makes it practical for laptops, phones, and other resource-constrained or offline devices. You can load it directly with Hugging Face Transformers, chat with it through the TRL CLI, or run it in the browser via Transformers.js. Quantized GGUF builds from the community (for example via llama.cpp or Ollama) shrink the footprint further for edge deployment.

License

SmolLM2-135M-Instruct is released under the Apache 2.0 license. That allows commercial use, modification, and redistribution, provided the license text and attribution notices are kept. Hugging Face also released the SFT dataset and the fine-tuning recipe, so the training process can be reproduced and extended.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

SmolLM2-135M-Instruct is a 135-million-parameter instruction-tuned language model from Hugging Face's SmolLM2 family. It uses a Llama-style transformer decoder architecture and was fine-tuned with supervised fine-tuning and Direct Preference Optimization on UltraFeedback. It is the smallest of the three SmolLM2 sizes (135M, 360M, 1.7B) and targets on-device chat and text tasks.

Because it has only 135M parameters and a memory footprint of roughly 720 MB in its default precision, SmolLM2-135M-Instruct runs comfortably on CPU and on resource-constrained devices without a dedicated GPU. CPU inference works but generates tokens more slowly than a GPU; a few GB of RAM is enough. Quantized GGUF builds shrink it further for laptops, phones, and edge hardware.

Yes. SmolLM2-135M-Instruct is released under the Apache 2.0 license, which permits free commercial and private use, modification, and redistribution as long as the license and notices are preserved. The weights are openly downloadable from Hugging Face, and Hugging Face also published the training recipe and SFT dataset.

SmolLM2-135M-Instruct supports a context window of 8K tokens, extended from the original 2K during pretraining by adjusting the data mix and the RoPE base. The model primarily understands and generates English; it was not trained for broad multilingual use, so quality drops sharply in other languages.

It suits lightweight on-device tasks such as basic chat, instruction following, text rewriting, and summarization where privacy or offline operation matters. Its IFEval and MT-Bench scores improved markedly over SmolLM-135M-Instruct, but at 135M parameters it is weak on factual knowledge and math (GSM8K around 1.4), so verify its output and reserve it for simple, constrained tasks rather than open-ended reasoning. Function calling is only available on the larger 1.7B variant.