Overview

North-Mini-Code-1.0 is CohereLabs' first model built for developers, and it targets agentic software engineering rather than general chat. The cohere2moe tag points to its architecture: a decoder-only Transformer-based sparse Mixture-of-Experts model with 30.5B total parameters but only about 3B active per token, drawn from 128 experts that activate 8 at a time. That sparse design is what lets a 30B-class coding model run on a single workstation GPU instead of a server rack.

The local-AI angle is the point. With Atomic Chat you load the weights once and the model runs fully on your own hardware: no API key, no per-token billing, no code leaving the machine. It works offline after download, which matters for proprietary repositories and air-gapped setups where sending source files to a hosted endpoint isn't allowed.

What it is good at

North-Mini-Code-1.0 was post-trained on real software engineering and terminal tasks, with native tool use and interleaved thinking. That shapes what it does well:

Agentic coding — it plans, edits files, and runs terminal commands across long task sequences, which suits it to coding agents like OpenCode rather than one-shot snippet generation.
Tool calling and reasoning — built-in tool_calling and thinking let it decide when to call a function, read the result, and reason through the next step instead of guessing in one pass.
Repository-scale work — the long context window keeps many files and a full agent trajectory in view at once, so it can trace a bug across modules or refactor against the whole codebase.

Running it locally

The model is 30.5B parameters with a context length of 500,000 tokens. The full-precision weights want a single H100 80GB (FP8) or 2x A100 40GB (BF16), but the community has published quantized builds: Unsloth GGUFs range from roughly 9GB up to full BF16, so smaller machines can load a lower-bit quant. Download the official weights with:

huggingface-cli download CohereLabs/North-Mini-Code-1.0

For serving, vLLM and SGLang support the cohere2moe architecture today; llama.cpp and Ollama need a build that includes the 128-expert support. In Atomic Chat you pick the quant that fits your VRAM and load it with one click, then chat or wire it into a coding agent locally.

License

North-Mini-Code-1.0 is released under the apache-2.0 license. That permits commercial use, modification, and redistribution, with patent protection and only an attribution requirement. Cohere also asks users to follow its Acceptable Use Policy alongside the license.

Frequently asked questions

North-Mini-Code-1.0 is CohereLabs' first model aimed at developers, a 30.5B-parameter Mixture-of-Experts model with about 3B active parameters built for agentic software engineering. It has native tool use and interleaved thinking, so it can plan, edit files, and run terminal commands as a coding agent. It's open-weight under the Apache 2.0 license and can run fully on your own hardware through Atomic Chat.

The full-precision model targets a single H100 80GB GPU in FP8, or 2x A100 40GB in BF16. Because it's a sparse MoE with only ~3B active parameters, quantized GGUF builds shrink the footprint a lot, ranging from roughly 9GB up to full BF16, so a consumer GPU can load a lower-bit quant. In Atomic Chat you choose the quant that fits your available VRAM.

Yes. It's released under the Apache 2.0 license, which allows commercial use, modification, and redistribution at no cost. Running it locally through Atomic Chat means there are no API fees or per-token charges. Cohere does ask that you also follow its Acceptable Use Policy.

Yes. Once you download the weights, inference runs entirely on your machine with no internet connection required. You can pull the files on a connected computer, move them to an air-gapped environment, and run the model there. That keeps proprietary source code on your own hardware, which is the main reason to use it locally in Atomic Chat.

Download the weights with huggingface-cli download CohereLabs/North-Mini-Code-1.0, or grab a quantized GGUF build for smaller hardware. For serving, vLLM and SGLang support its cohere2moe architecture today, while llama.cpp and Ollama need a build that includes the 128-expert support. The simplest path is Atomic Chat, where you pick a quant and load it with one click, then use it for chat or wire it into a coding agent.

North-Mini-Code-1.0

More models

At a glance

Overview

What it is good at

Running it locally

License

Frequently asked questions