diffusiongemma-26B-A4B-it

Updated
25.06.2026
Thinking
Vision
Audio
Reasoning
Code
Multilingual

huggingface-cli download google/diffusiongemma-26B-A4B-it
from transformers import AutoModel
model = AutoModel.from_pretrained("google/diffusiongemma-26B-A4B-it")

More models

NameSize / UsageContextInput
gemma-4-31B-it
256KText, Image, Audio
gemma-4-26B-A4B-it
256KText, Image, Audio
gemma-4-12B-it
125KText, Image, Audio
gemma-4-E2B-it
125KText, Image, Audio
gemma-3-270m
32KText
gemma-3-1b-it
32KText

At a glance

  • License: Apache 2.0
  • Context length: 256K tokens
  • Languages: Multilingual
  • Minimum hardware: ~15 GB VRAM
  • Strengths: reasoning and on-device inference

Overview

diffusiongemma-26B-A4B-it is an open-weight text diffusion model from Google DeepMind, built on the Gemma 4 architecture. It is a Mixture-of-Experts (MoE) design: 25.8B total parameters, but only about 3.8B activate on each forward pass, which keeps memory and compute far below what the headline size suggests. It is multimodal, accepting text, image, and audio input, and ships under the apache-2.0 license.

What sets it apart is how it writes. Instead of predicting one token at a time, it starts from a block of placeholder tokens and refines them across several denoising passes, generating up to 256 tokens in parallel per pass. Running it inside Atomic Chat keeps every prompt, file, and response on your own machine. No request leaves your device, so it works on a plane, behind a firewall, or anywhere offline.

What it is good at

The parallel, bidirectional approach gives diffusiongemma-26B-A4B-it a structural edge on tasks where the model needs to see the whole output at once rather than guess left to right.

  • Code infilling — filling a gap in the middle of a file, where the model can attend to code on both sides of the cursor before it writes.
  • Inline editing — you change one sentence and it produces a local replacement quickly, drawing on its code and reasoning capabilities.
  • Multimodal and multilingual work — its vision and audio inputs handle screenshots or clips, and multilingual support covers prompts across many languages with a 256K context window for long documents.

Running it locally

The model is 25.8B parameters with a 256K context length. Because only ~3.8B parameters are active at inference, a 4-bit quant of the 26B-A4B class fits in roughly 18GB of VRAM, which puts it within reach of a 24GB consumer card like an RTX 4090 or 5090. Budget extra headroom for the KV cache, since long contexts grow memory use beyond the weights alone. Pull the weights with:

huggingface-cli download google/diffusiongemma-26B-A4B-it

You can load it through Transformers or vLLM for scripted use, or open it in Atomic Chat with one click and start chatting without touching a config file.

License

diffusiongemma-26B-A4B-it is released under the apache-2.0 license. That permits commercial use, modification, and redistribution, so you can run it locally, fine-tune it, and ship it inside products without a usage fee.

Desktop
macOS
(M1 or better)
Download
Windows
(x64)
Download
Linux
(x86_64)
Download

Frequently asked questions

It is an open-weight text diffusion model from Google DeepMind, based on the Gemma 4 architecture. Rather than writing one token at a time, it denoises a block of up to 256 tokens in parallel, which makes it well suited to code infilling and inline editing. It is a Mixture-of-Experts model with 25.8B total parameters and about 3.8B active per pass.

A 4-bit quant of the 26B-A4B class needs roughly 18GB of VRAM, so a 24GB consumer GPU like an RTX 4090 or 5090 can run it comfortably. The MoE design activates only ~3.8B of its 25.8B parameters per pass, which keeps the memory footprint low for its size. Leave extra room for the KV cache, which grows with longer context.

Yes. It is released under the apache-2.0 license, which allows commercial use, modification, and redistribution at no cost. You can download the weights, run them on your own hardware, and fine-tune the model without paying a usage fee.

Yes. Once you download the weights, it runs fully on your own device with no internet connection. Inside Atomic Chat every prompt and response stays local, so nothing is sent to a server and the model works on a plane or behind a firewall.

Its parallel, bidirectional denoising gives it an edge on tasks that need awareness of surrounding context, such as filling a gap in the middle of a code file or making a quick inline edit to existing text. It also handles structured output like brackets and tags well. On raw benchmark quality it trails standard Gemma 4, so its real advantage is speed and these specific editing tasks.