Llama-3.2-3B-Instruct

Updated
25.06.2026
Tools
Reasoning
Code
Multilingual

huggingface-cli download meta-llama/Llama-3.2-3B-Instruct
from transformers import AutoModel
model = AutoModel.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

More models

NameSize / UsageContextInput
sulphur-2-base
54213 GB421KText
Llama-3.1-8B-Instruct
128KText

At a glance

  • License: Llama3.2
  • Context length: 128K tokens
  • Languages: Multilingual
  • Minimum hardware: ~8 GB VRAM
  • Strengths: lightweight on-device chat and summarization

Overview

Llama-3.2-3B-Instruct is a 3.2-billion-parameter instruction-tuned model from Meta. It uses a dense transformer architecture (no mixture-of-experts) and was tuned for following instructions and holding multi-turn conversations, with a 128K-token context window for long documents and chat history.

The model is small enough to run entirely on your own hardware through Atomic Chat. Your prompts and files stay on the machine, the model answers without an internet connection, and there are no API fees or rate limits once the weights are downloaded.

What it is good at

Meta tuned Llama-3.2-3B-Instruct for multilingual dialogue, summarization, and agentic retrieval. The capabilities reported for it map to a few practical jobs:

  • Tool calling — it can emit structured function calls, so you can wire it to local scripts or a retrieval step and have it decide when to call them.
  • Reasoning and summarization — it condenses long reports or threads inside the 128K context and answers questions grounded in the text you give it.
  • Code — it drafts and explains short functions, shell commands, and config snippets for quick local coding help.
  • Multilingual chat — it handles dialogue across several languages, useful for translation drafts and answering in the user's language.

Running it locally

At 3.2B parameters the model is light. In full FP16 precision it needs roughly 7 GB of VRAM; a 4-bit quantized build drops that to about 1.8-2 GB, which fits on a 6 GB consumer GPU or runs on a modern laptop. The 128K context costs extra memory, so long inputs raise the requirement.

huggingface-cli download meta-llama/Llama-3.2-3B-Instruct

You can load the weights with Hugging Face Transformers or serve them with vLLM, or skip the setup and open the model in Atomic Chat with one click for a fully on-device chat.

License

Llama-3.2-3B-Instruct is released under the Llama 3.2 Community License. It permits commercial and research use, with redistribution and fine-tuning allowed under Meta's acceptable-use terms; very large products (over 700 million monthly active users) need a separate license from Meta. Check the license text before shipping a commercial product on top of it.

Desktop
macOS
(M1 or better)
Download
Windows
(x64)
Download
Linux
(x86_64)
Download

Frequently asked questions

Llama-3.2-3B-Instruct is a 3.2-billion-parameter instruction-tuned language model from Meta, built on a dense transformer architecture with a 128K-token context window. Meta tuned it for following instructions, multilingual dialogue, summarization, and tool use. Its small size makes it a good fit for running on a laptop or single consumer GPU through Atomic Chat.

In full FP16 precision the model needs about 7 GB of VRAM. A 4-bit quantized build cuts that to roughly 1.8-2 GB, so it runs on a 6 GB GPU and even on modern laptops without a dedicated GPU. Using the full 128K context window adds memory on top of those figures.

Yes. The weights are released under the Llama 3.2 Community License, which allows free commercial and research use. Running it locally in Atomic Chat means no API fees or usage limits. Products with more than 700 million monthly active users need a separate license from Meta.

Yes. Once you download the weights, the model runs fully on your own machine with no internet connection. Prompts and files never leave the device, which suits privacy-sensitive work. Atomic Chat loads the model on-device so chats stay local and offline.

It works well for chat assistants, summarizing long documents, and answering questions grounded in text you provide, all within its 128K context. It also supports tool calling and multilingual conversation, so you can connect it to local scripts or use it across several languages. For heavier reasoning or large codebases, a bigger model will perform better.