gemma-4-12B-it

Updated
24.06.2026
Thinking
Embedding
Vision
Audio
Reasoning
Code
Multilingual

huggingface-cli download google/gemma-4-12B-it
from transformers import AutoModel
model = AutoModel.from_pretrained("google/gemma-4-12B-it")

More models

NameSize / UsageContextInput
gemma-4-31B-it
256KText, Image, Audio
gemma-4-26B-A4B-it
256KText, Image, Audio

At a glance

  • License: Apache 2.0
  • Context length: 125K tokens
  • Languages: Multilingual
  • Minimum hardware: ~7 GB VRAM
  • Strengths: reasoning, coding and on-device inference

Overview

gemma-4-12B-it is a 12B-parameter instruction-tuned model from Google, part of the Gemma 4 family. Its tags mark it as a gemma4_unified, encoder-free multimodal model: instead of bolting separate vision and audio encoders onto a language model, it projects raw image patches and audio waveforms straight into the embedding space. The base_model:google/gemma-4-12B tag shows this instruction-tuned release is fine-tuned on top of the Gemma 4 12B base, with a 128,000-token context window.

The point of running it in Atomic Chat is that everything happens on your own machine. The weights sit on your disk, inference runs on your CPU or GPU, and no prompt leaves the device. That makes gemma-4-12B-it a fit for private notes, confidential code, and offline work where sending data to a hosted API is not an option.

What it is good at

The model carries reasoning, vision, audio, code, and multilingual capabilities, so a single local model covers tasks that used to need several:

  • Multimodal Q&A — the unified architecture reads text, images, and audio in the same prompt, so you can ask about a screenshot, a chart, or a recorded clip without a separate vision model.
  • Step-by-step reasoning and code — a built-in thinking mode lets it work through a problem before answering, which helps with math, debugging, and generating or explaining code.
  • Multilingual drafting — Gemma 4 supports well over 140 languages, so translation, summarization, and writing across languages run on the same local weights.

Running it locally

At 12B parameters the model is small enough for a recent laptop. A 4-bit quant weighs roughly 6.7 GB and runs on an 8 GB GPU, while 16 GB of VRAM or unified memory gives comfortable headroom once you push toward the 128,000-token context, since the KV cache grows with prompt length. Apple M-series Macs with 16-32 GB of unified memory handle it well because the whole memory pool is available to the model.

huggingface-cli download google/gemma-4-12B-it

You can load the downloaded weights with Hugging Face Transformers or serve them through vLLM for higher throughput on a dedicated GPU. In Atomic Chat the model appears in the catalog and downloads with one click, then runs offline from the app.

License

gemma-4-12B-it is released under the apache-2.0 license. That permits free use, modification, redistribution, and commercial deployment, including fine-tuning your own variant, as long as you keep the license and attribution notices. You only pay for the hardware you run it on.

Desktop
macOS
(M1 or better)
Download
Windows
(x64)
Download
Linux
(x86_64)
Download

Frequently asked questions

gemma-4-12B-it is a 12B-parameter, instruction-tuned model in Google's Gemma 4 family. It uses an encoder-free unified architecture that takes text, images, and audio in the same prompt, and it ships with a 128,000-token context window. The "it" marks it as the instruction-tuned build, fine-tuned on top of the Gemma 4 12B base for chat and task following.

A 4-bit quantized build weighs about 6.7 GB, so it fits on an 8 GB GPU for short prompts. For comfortable use with longer context, aim for 16 GB of VRAM or unified memory, because the KV cache grows as the prompt gets longer. Apple M-series Macs with 16-32 GB of unified memory are a good match since the full memory pool is usable.

Yes. It is released under the apache-2.0 license, so the weights are free to download, run, modify, and use commercially. The only cost is the hardware or cloud you run it on, and in Atomic Chat it runs locally for free.

Yes. Once the weights are downloaded, the model runs entirely on your own CPU or GPU with no internet connection. In Atomic Chat every prompt and response stays on your device, which suits private, confidential, or air-gapped work.

It handles general Q&A, code generation and debugging, summarization, and image or audio analysis from a single local model. The built-in thinking mode helps with step-by-step reasoning and math, and support for 140+ languages covers translation and multilingual writing. For most everyday tasks it is fast, private, and free to run locally.