gemma-4-26B-A4B-it

Updated
24.06.2026
Thinking
Embedding
Vision
Audio
Reasoning
Code
Multilingual

huggingface-cli download google/gemma-4-26B-A4B-it
from transformers import AutoModel
model = AutoModel.from_pretrained("google/gemma-4-26B-A4B-it")

More models

NameSize / UsageContextInput
gemma-4-31B-it
256KText, Image, Audio
gemma-4-12B-it
125KText, Image, Audio

At a glance

  • License: Apache 2.0
  • Context length: 256K tokens
  • Languages: Multilingual
  • Minimum hardware: ~15 GB VRAM
  • Strengths: reasoning, coding and on-device inference

Overview

gemma-4-26B-A4B-it is an instruction-tuned model from Google, part of the Gemma 4 family. Its name encodes the architecture: it is a Mixture-of-Experts (MoE) model with 26.5B total parameters but only about 4B active per token, routing each forward pass through a small subset of 128 fine-grained experts. The practical result is that it runs close to the speed of a 4B model while keeping the knowledge of a much larger one.

On atomic.chat the model runs through Atomic Chat, a free open-source desktop app that loads open-weight LLMs directly on your machine. Nothing is sent to a server. Once the weights are downloaded, gemma-4-26B-A4B-it works fully offline on your own hardware, so prompts and files stay local.

What it is good at

The model is multimodal and tuned for reasoning, with a 262144-token context window that handles long documents and codebases in a single pass.

  • Code — generation, completion, and correction across languages, plus agentic tool-use workflows that call functions and parse structured output.
  • Reasoning and thinking — a configurable thinking mode where the model writes an internal reasoning pass before its final answer, which helps on multi-step math and logic problems.
  • Vision and multimodal input — it reads images and short video alongside text, so you can ask questions about a screenshot, a chart, or a diagram locally.

Running it locally

At 26.5B total parameters the weights are compact thanks to quantization. A Q4_K_M build of gemma-4-26B-A4B-it fits in roughly 18GB of VRAM, which a single 24GB consumer GPU or an Apple Silicon Mac with enough unified memory can hold. The 262144-token context grows the KV cache as conversations get long, so leave memory headroom beyond the weights themselves.

huggingface-cli download google/gemma-4-26B-A4B-it

The model has day-one support in Transformers and vLLM, plus llama.cpp, MLX, and Ollama. In Atomic Chat you can load it one-click without writing any setup code.

License

gemma-4-26B-A4B-it is released under the apache-2.0 license. That permits free use, modification, redistribution, and commercial deployment, including running the weights privately on your own hardware and fine-tuning them for your own projects.

Desktop
macOS
(M1 or better)
Download
Windows
(x64)
Download
Linux
(x86_64)
Download

Frequently asked questions

It is an instruction-tuned Mixture-of-Experts model from Google in the Gemma 4 family. It has 26.5B total parameters but activates only about 4B per token through top-8 routing over 128 experts, so it runs roughly as fast as a 4B model. It is multimodal, supports a 262144-token context, and handles code, reasoning, and vision.

A 4-bit (Q4_K_M) build of gemma-4-26B-A4B-it needs around 18GB of VRAM, which a single 24GB consumer GPU can hold. Apple Silicon Macs with enough unified memory can run it too. Leave extra headroom because the long context window expands the KV cache during longer chats.

Yes. The weights are published under the apache-2.0 license, which allows free personal and commercial use, modification, and redistribution. Atomic Chat, the app that runs it on atomic.chat, is also free and open-source.

Yes. After you download the weights once, gemma-4-26B-A4B-it runs entirely on your machine with no internet connection required. Prompts, documents, and images never leave your device, which is the point of running it locally in Atomic Chat.

It is strongest at coding, multi-step reasoning, and agentic tool-use, and it has a configurable thinking mode that reasons before answering. Its vision support lets it analyze images and short video, and the 262144-token context makes it useful for long documents and large codebases.