Nex-N2-Pro

Updated
24.06.2026
Tools
Thinking
Vision
Reasoning
Code

huggingface-cli download nex-agi/Nex-N2-Pro
from transformers import AutoModel
model = AutoModel.from_pretrained("nex-agi/Nex-N2-Pro")

More models

NameSize / UsageContextInput
Nex-N2-mini
256KText, Image

At a glance

  • License: Apache 2.0
  • Context length: 256K tokens
  • Languages: Multilingual
  • Minimum hardware: ~222 GB VRAM
  • Strengths: reasoning, coding and on-device inference

Overview

Nex-N2-Pro is an open-weight large language model from nex-agi, released in June 2026 under the Apache 2.0 license. It uses a mixture-of-experts (MoE) design built on the Qwen3.5 architecture, with 396.8B total parameters but only about 17B active per token. That sparsity is what makes it tractable to run yourself: you get the capacity of a frontier-scale model while paying inference cost closer to a 17B dense model.

The model handles both text and images as input and produces text. In Atomic Chat it runs fully on-device, so prompts, code, and documents never leave your machine. There is no API key, no usage metering, and no network round-trip once the weights are downloaded, which means it keeps working offline.

What it is good at

Nex-N2-Pro is tuned for agentic work and carries capabilities for reasoning, tool calling, vision, and code. A few concrete things it does well:

  • Coding and debugging — it writes, reads, and fixes code across a repository, and is built for the plan-implement-debug loop rather than one-off snippets.
  • Tool calling and agent loops — it can call functions and chain multi-step tool use, so you can wire it into local scripts, file operations, or a research workflow.
  • Vision and long-context reasoning — it reads images alongside text and reasons over inputs up to 262,144 tokens, enough to hold large codebases or long documents in a single session.

Running it locally

At 396.8B total parameters the full-precision weights are large, but the MoE layout and quantization bring it within reach of high-memory workstations. A 4-bit GGUF build lands around 214-256 GB of combined memory, which runs on a 256 GB Mac Studio or on a 24 GB GPU paired with large system RAM using llama.cpp MoE offloading. The 262,144-token context is available locally, bounded by how much memory you can spare for the KV cache.

huggingface-cli download nex-agi/Nex-N2-Pro

You can load it with Transformers or vLLM for scripted use, run quantized GGUF builds through llama.cpp, or open it in Atomic Chat, which downloads the weights and sets up the runtime in one click.

License

Nex-N2-Pro ships under Apache 2.0. You can use it commercially, modify the weights, fine-tune it, and redistribute your own builds, as long as you keep the license and attribution notices. There is no per-token fee and no separate commercial agreement to sign.

Desktop
macOS
(M1 or better)
Download
Windows
(x64)
Download
Linux
(x86_64)
Download

Frequently asked questions

Nex-N2-Pro is an open-weight mixture-of-experts language model from nex-agi, released in June 2026 and built on the Qwen3.5 architecture. It has 396.8B total parameters with about 17B active per token, accepts text and images, and is tuned for coding, tool use, and long agentic workflows. It runs under the Apache 2.0 license, so you can download and use it for free.

A 4-bit quantized build needs roughly 214-256 GB of combined memory. That fits a Mac Studio with 256 GB of unified memory, or a PC with a 24 GB GPU and large system RAM using llama.cpp MoE offloading, where reports show 25+ tokens per second. Because only 17B parameters activate per token, you do not need a multi-GPU server to get usable speeds.

Yes. The weights are released under the Apache 2.0 license, which permits free personal and commercial use, modification, and redistribution. Running it locally in Atomic Chat costs nothing beyond your own hardware and electricity, with no API fees or per-token charges.

Yes. Once the weights are downloaded, Nex-N2-Pro runs entirely on your machine with no network connection required. In Atomic Chat every prompt and response stays on-device, so your code and documents are never sent to a server.

It is built for agentic engineering: writing and debugging code, calling tools, and running multi-step workflows on its own. It reported 80.8 on SWE-Bench Verified, a benchmark of fixing real bugs in real repositories, which puts it in useful day-to-day coding territory. Its 262,144-token context and vision support also let it reason over large codebases, long documents, and images.