NVIDIA-Nemotron-Nano-9B-v2

Updated

Thinking

Reasoning

Code

Tools

Multilingual

Run

A 9B hybrid Mamba2-Transformer reasoning model from NVIDIA with toggleable thinking, 128K context, and tool calling.

pip install -U "transformers>=4.48" accelerate
huggingface-cli download nvidia/NVIDIA-Nemotron-Nano-9B-v2

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    "messages": [{"role": "user", "content": "Solve: integrate x^2 dx"}],
    "temperature": 0.6,
    "top_p": 0.95,
    "max_tokens": 1024
  }'

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("nvidia/NVIDIA-Nemotron-Nano-9B-v2", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-Nano-9B-v2")
messages = [{"role": "user", "content": "Explain the Mamba-2 layer in one paragraph."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(model.generate(inputs, max_new_tokens=1024)[0]))

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "local" });
const res = await client.chat.completions.create({
  model: "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
  messages: [{ role: "user", content: "Write a Python function to check if a number is prime." }],
  temperature: 0.6,
  top_p: 0.95,
});
console.log(res.choices[0].message.content);

More models

View all

Name	Size / Usage	Context	Input
Llama-3.3-Nemotron-Super-49B-v1.5	Reasoning, chat, agentic tasks	128K	Text

At a glance

License: NVIDIA Open Model License (commercial use allowed)
Context length: 128K tokens
Architecture: Mamba2-Transformer hybrid (Nemotron-H)
Languages: English, German, Spanish, French, Italian, Japanese
Minimum hardware: ~10-12 GB VRAM at 4-bit; runs 128K on a single 22 GB A10G in bf16
Strengths: toggleable reasoning, math, code, tool calling

Overview

NVIDIA-Nemotron-Nano-9B-v2 is a 9-billion-parameter language model that NVIDIA trained from scratch and released on Hugging Face on August 18, 2025. It was produced by pretraining a 12B base model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens with an FP8 recipe, then compressing and distilling it down using NVIDIA's Minitron strategy. The pretraining data has a cutoff of September 2024. The model is part of the Nemotron family and is built on the Nemotron-H architecture, a Mamba2-Transformer hybrid that replaces most self-attention layers with Mamba-2 layers and keeps only four attention layers, which speeds up generation of long reasoning traces.

What it's good at

This is a unified reasoning and chat model: it can produce an explicit thinking trace before its final answer, and that behavior can be turned off through the system prompt (with a small accuracy cost on hard prompts). It also supports a runtime thinking-budget control that caps how many tokens it spends reasoning. On NVIDIA's published benchmarks in Reasoning-On mode it edges out Qwen3-8B, scoring 72.1% on AIME25, 97.8% on MATH500, 64.0% on GPQA, 71.1% on LiveCodeBench, and 90.3% on IFEval. It handles tool and function calling (66.9% on BFCL v3) and supports English, German, Spanish, French, Italian, and Japanese. NVIDIA reports up to 6x higher inference throughput than comparable models in long-output reasoning settings.

Running locally

The model targets a single GPU: NVIDIA distilled it specifically to run inference at the full 128K context on one A10G (22 GiB) in bfloat16. At 4-bit quantization it fits in roughly 10-12 GB of VRAM. It runs through Hugging Face transformers (it needs trust_remote_code=True because of the custom Nemotron-H modeling code), and is also served on vLLM, NVIDIA NIM, Together AI, OpenRouter, and Amazon Bedrock. Recommended sampling is temperature 0.6 and top_p 0.95 with reasoning enabled.

License

Use is governed by the NVIDIA Open Model License Agreement. NVIDIA states the model is ready for commercial use, and the license permits commercial deployment and derivative work under its terms rather than a standard OSI license like Apache 2.0 or MIT.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

NVIDIA-Nemotron-Nano-9B-v2 is a 9-billion-parameter language model trained from scratch by NVIDIA and released in August 2025. It is a unified reasoning and chat model that can generate an explicit thinking trace before its final answer, with that reasoning behavior controllable through the system prompt. It uses the Nemotron-H hybrid architecture, combining Mamba-2 layers with a small number of Transformer attention layers for faster long-output inference.

NVIDIA distilled the model specifically so it can run inference at the full 128K context on a single NVIDIA A10G GPU with 22 GiB of memory in bfloat16. At 4-bit quantization the weights fit in roughly 10-12 GB of VRAM, which brings it within reach of consumer cards like an RTX 3060 12 GB or RTX 4070. It runs through Hugging Face transformers, vLLM, and NVIDIA NIM.

Yes. The model weights are published openly on Hugging Face and governed by the NVIDIA Open Model License Agreement, which NVIDIA states makes the model ready for commercial use. It is not released under a standard OSI license such as Apache 2.0 or MIT, so deployment is permitted under the terms of NVIDIA's own license rather than an unrestricted open-source one. It is also offered free to try through hosts like OpenRouter.

The model supports a context length of up to 128K tokens for both input and output. Its officially supported languages are English, German, Spanish, French, Italian, and Japanese. It is primarily intended for English and coding tasks, with the other five languages handled as secondary capabilities.

On NVIDIA's published reasoning-on benchmarks, Nemotron-Nano-9B-v2 scores at or above Qwen3-8B: 72.1% vs 69.3% on AIME25, 97.8% vs 96.3% on MATH500, 64.0% vs 59.6% on GPQA, and 71.1% vs 59.5% on LiveCodeBench. Beyond accuracy, its Mamba2-Transformer hybrid design gives it up to roughly 6x higher inference throughput than comparable Transformer models in long-output reasoning settings.