Overview
Llama-3.3-Nemotron-Super-49B-v1.5 is a reasoning and chat model released by NVIDIA on July 25, 2025, as part of the Llama Nemotron collection. It is a derivative of Meta's Llama-3.3-70B-Instruct: NVIDIA applied a Neural Architecture Search (NAS) method called Puzzle to compress the 70B reference model down to 49B parameters. The search produces non-standard blocks where some attention layers are skipped or replaced by a linear layer and the FFN width varies per block, trimming memory so the model fits on a single H100-80GB or H200 GPU. It supports a 128K-token context and is an upgrade over the earlier v1 checkpoint.
What it's good at
The model is built for reasoning, instruction following, and agentic work such as RAG and tool calling. Post-training combined supervised fine-tuning on math, code, science, and tool use with several reinforcement-learning stages (RPO for chat, RLVR for reasoning, and iterative DPO for tool calling). On NVIDIA's own evaluations in reasoning-on mode it scores 97.4 on MATH500, 87.5 on AIME 2024, 73.58 on LiveCodeBench, 71.97 on GPQA, and 71.75 on BFCL v3 for function calling. It also ships a dual mode: an empty system prompt gives a full thinking trace, while /no_think turns reasoning off for faster direct answers.
Running locally
The weights run with Transformers (using trust_remote_code) and serve well under vLLM 0.9.2. NVIDIA tested it on 2x H100-80GB or 2x A100-80GB and recommends Ampere or Hopper GPUs. Full bf16 inference needs roughly 100 GB of memory across GPUs; with 4-bit quantization it can run on around 48 GB. Recommended sampling is temperature 0.6 and top-p 0.95 for reasoning on, and greedy decoding for reasoning off. A tool-call parser is included in the repo for vLLM's auto tool-choice mode.
License
Use is governed by the NVIDIA Open Model License, with the additional Llama 3.3 Community License Agreement because the model is built on Llama. NVIDIA states the model is ready for commercial use. The weights are openly downloadable from Hugging Face, so you can self-host and deploy it in production subject to those license terms.
