Overview
NVIDIA-Nemotron-Nano-9B-v2 is a 9-billion-parameter language model that NVIDIA trained from scratch and released on Hugging Face on August 18, 2025. It was produced by pretraining a 12B base model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens with an FP8 recipe, then compressing and distilling it down using NVIDIA's Minitron strategy. The pretraining data has a cutoff of September 2024. The model is part of the Nemotron family and is built on the Nemotron-H architecture, a Mamba2-Transformer hybrid that replaces most self-attention layers with Mamba-2 layers and keeps only four attention layers, which speeds up generation of long reasoning traces.
What it's good at
This is a unified reasoning and chat model: it can produce an explicit thinking trace before its final answer, and that behavior can be turned off through the system prompt (with a small accuracy cost on hard prompts). It also supports a runtime thinking-budget control that caps how many tokens it spends reasoning. On NVIDIA's published benchmarks in Reasoning-On mode it edges out Qwen3-8B, scoring 72.1% on AIME25, 97.8% on MATH500, 64.0% on GPQA, 71.1% on LiveCodeBench, and 90.3% on IFEval. It handles tool and function calling (66.9% on BFCL v3) and supports English, German, Spanish, French, Italian, and Japanese. NVIDIA reports up to 6x higher inference throughput than comparable models in long-output reasoning settings.
Running locally
The model targets a single GPU: NVIDIA distilled it specifically to run inference at the full 128K context on one A10G (22 GiB) in bfloat16. At 4-bit quantization it fits in roughly 10-12 GB of VRAM. It runs through Hugging Face transformers (it needs trust_remote_code=True because of the custom Nemotron-H modeling code), and is also served on vLLM, NVIDIA NIM, Together AI, OpenRouter, and Amazon Bedrock. Recommended sampling is temperature 0.6 and top_p 0.95 with reasoning enabled.
License
Use is governed by the NVIDIA Open Model License Agreement. NVIDIA states the model is ready for commercial use, and the license permits commercial deployment and derivative work under its terms rather than a standard OSI license like Apache 2.0 or MIT.
