Llama-3.3-Nemotron-Super-49B-v1.5

Updated

Thinking

Reasoning

Code

Tools

Multilingual

Run

A 49B reasoning and chat LLM from NVIDIA, distilled from Llama-3.3-70B via Neural Architecture Search with a 128K context.

pip install -U "transformers" "vllm==0.9.2"
huggingface-cli download nvidia/Llama-3_3-Nemotron-Super-49B-v1_5
python3 -m vllm.entrypoints.openai.api_server \
  --model nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 \
  --trust-remote-code --max-model-len 65536 --tensor-parallel-size 2

curl http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Llama-3_3-Nemotron-Super-49B-v1_5",
    "messages": [
      {"role": "system", "content": ""},
      {"role": "user", "content": "What is 18% of 100?"}
    ],
    "temperature": 0.6,
    "top_p": 0.95
  }'

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("nvidia/Llama-3_3-Nemotron-Super-49B-v1_5", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("nvidia/Llama-3_3-Nemotron-Super-49B-v1_5")
messages = [{"role": "system", "content": ""}, {"role": "user", "content": "Explain the NAS approach in one paragraph."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=512, temperature=0.6, top_p=0.95)
print(tokenizer.decode(out[0], skip_special_tokens=True))

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:5000/v1", apiKey: "dummy" });
const res = await client.chat.completions.create({
  model: "Llama-3_3-Nemotron-Super-49B-v1_5",
  messages: [
    { role: "system", content: "" },
    { role: "user", content: "Write a function to reverse a linked list." }
  ],
  temperature: 0.6,
  top_p: 0.95
});
console.log(res.choices[0].message.content);

More models

View all

Name	Size / Usage	Context	Input
NVIDIA-Nemotron-Nano-9B-v2	Reasoning, agents, chat	128K	Text

At a glance

License: NVIDIA Open Model License (+ Llama 3.3 Community License)
Context length: 128K tokens
Parameters: 49B dense, distilled from Llama-3.3-70B
Languages: English plus German, French, Italian, Portuguese, Hindi, Spanish, Thai
Minimum hardware: ~48 GB VRAM at 4-bit; fits a single H100/H200
Strengths: math and code reasoning, tool calling, RAG, agentic chat

Overview

Llama-3.3-Nemotron-Super-49B-v1.5 is a reasoning and chat model released by NVIDIA on July 25, 2025, as part of the Llama Nemotron collection. It is a derivative of Meta's Llama-3.3-70B-Instruct: NVIDIA applied a Neural Architecture Search (NAS) method called Puzzle to compress the 70B reference model down to 49B parameters. The search produces non-standard blocks where some attention layers are skipped or replaced by a linear layer and the FFN width varies per block, trimming memory so the model fits on a single H100-80GB or H200 GPU. It supports a 128K-token context and is an upgrade over the earlier v1 checkpoint.

What it's good at

The model is built for reasoning, instruction following, and agentic work such as RAG and tool calling. Post-training combined supervised fine-tuning on math, code, science, and tool use with several reinforcement-learning stages (RPO for chat, RLVR for reasoning, and iterative DPO for tool calling). On NVIDIA's own evaluations in reasoning-on mode it scores 97.4 on MATH500, 87.5 on AIME 2024, 73.58 on LiveCodeBench, 71.97 on GPQA, and 71.75 on BFCL v3 for function calling. It also ships a dual mode: an empty system prompt gives a full thinking trace, while /no_think turns reasoning off for faster direct answers.

Running locally

The weights run with Transformers (using trust_remote_code) and serve well under vLLM 0.9.2. NVIDIA tested it on 2x H100-80GB or 2x A100-80GB and recommends Ampere or Hopper GPUs. Full bf16 inference needs roughly 100 GB of memory across GPUs; with 4-bit quantization it can run on around 48 GB. Recommended sampling is temperature 0.6 and top-p 0.95 for reasoning on, and greedy decoding for reasoning off. A tool-call parser is included in the repo for vLLM's auto tool-choice mode.

License

Use is governed by the NVIDIA Open Model License, with the additional Llama 3.3 Community License Agreement because the model is built on Llama. NVIDIA states the model is ready for commercial use. The weights are openly downloadable from Hugging Face, so you can self-host and deploy it in production subject to those license terms.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

Llama-3.3-Nemotron-Super-49B-v1.5 is a 49B-parameter reasoning and chat model from NVIDIA, derived from Meta's Llama-3.3-70B-Instruct. NVIDIA used Neural Architecture Search to shrink the 70B reference model to 49B while keeping accuracy high, so it fits on a single H100 or H200 GPU. It supports a 128K-token context and is post-trained for math, code, reasoning, and tool calling.

The model was optimized to fit on a single NVIDIA H100-80GB or H200 GPU at high workloads, and NVIDIA's test hardware was 2x H100-80GB or 2x A100-80GB. Running it in full precision needs roughly 100 GB of memory; with 4-bit quantization it can run on around 48 GB of VRAM. It serves well with vLLM and Transformers on Ampere or Hopper GPUs.

The weights are released openly on Hugging Face and the model is ready for commercial use. Use is governed by the NVIDIA Open Model License, with the additional Llama 3.3 Community License Agreement since the model is built on Llama. You can download and run it yourself at no cost, subject to the terms of those licenses.

By default, with an empty system prompt, the model responds in reasoning ON mode and emits a thinking trace. Adding /no_think to the system prompt switches it to reasoning OFF mode. NVIDIA recommends temperature 0.6 and top-p 0.95 for reasoning ON, and greedy decoding for reasoning OFF.

The model is primarily intended for English and coding languages. NVIDIA also lists support for German, French, Italian, Portuguese, Hindi, Spanish, and Thai. Its post-training focused heavily on English single- and multi-turn chat, so English performance is strongest.