gemma-4-31B-it

Updated
24.06.2026
Thinking
Embedding
Vision
Audio
Reasoning
Code
Multilingual

huggingface-cli download google/gemma-4-31B-it
from transformers import AutoModel
model = AutoModel.from_pretrained("google/gemma-4-31B-it")

More models

NameSize / UsageContextInput
gemma-4-26B-A4B-it
256KText, Image, Audio
gemma-4-12B-it
125KText, Image, Audio

At a glance

  • License: Apache 2.0
  • Context length: 256K tokens
  • Languages: Multilingual
  • Minimum hardware: ~19 GB VRAM
  • Strengths: reasoning, coding and on-device inference

Overview

gemma-4-31B-it is an instruction-tuned open-weight model from Google, part of the Gemma 4 family built by Google DeepMind. It carries 32.7B parameters and a 262,144-token context window, and the HuggingFace tags mark it as an image-text-to-text model derived from the base_model google/gemma-4-31B. That puts vision and long-context handling at the center of what it does, not bolted on after the fact.

In Atomic Chat the model runs fully on your own machine. Weights stay on local storage, inference happens on your hardware, and nothing about a prompt leaves the device. That is the practical appeal of gemma-4-31B-it for anyone who wants a capable multimodal model without sending documents or images to a remote API.

What it is good at

The capability set covers vision, reasoning, code, embeddings, and multilingual text, so a single local model handles a wide span of work:

  • Document and image understanding — read PDFs, parse charts, run OCR including handwriting, and answer questions about screenshots or UI captures, with images accepted at varying resolutions.
  • Step-by-step reasoning and tool use — a built-in thinking mode works through problems before answering, and native structured tool calling drives agentic workflows on-device.
  • Code and multilingual text — write and explain code, and work across more than 140 languages for translation, summarization, and drafting without a network call.

Running it locally

At 32.7B parameters the model is heavy but reachable on a single high-VRAM card. A Q4 quantization sits around 18-19GB and runs comfortably on a 24GB GPU like an RTX 3090 or RTX 4090; Q8 is roughly 32.6GB. The 262,144-token context comes at a cost — a full-length KV cache can add around 22GB on top of the weights, so plan memory around the context length you actually use. On Apple Silicon, unified memory of 32GB or more covers the model plus OS overhead.

huggingface-cli download google/gemma-4-31B-it

From there you can load it through Transformers or serve it with vLLM, or skip the setup and open it in Atomic Chat with one click, which downloads and runs the model offline for you.

License

gemma-4-31B-it is released under the apache-2.0 license. That permits free personal and commercial use with no fee or agreement with Google — you only need to keep the license text in your distribution and note any changes you make, which makes it straightforward to approve for production work.

Desktop
macOS
(M1 or better)
Download
Windows
(x64)
Download
Linux
(x86_64)
Download

Frequently asked questions

It is an instruction-tuned open-weight model from Google's Gemma 4 family, built by Google DeepMind, with 32.7B parameters and a 262,144-token context window. The HuggingFace tags list it as an image-text-to-text model, so it handles both images and text, and it adds reasoning, code, and multilingual support. In Atomic Chat it runs entirely on your own hardware, offline and private.

A 4-bit (Q4) quantization needs roughly 18-19GB of VRAM, and a 24GB card such as an RTX 3090 or RTX 4090 is a comfortable target for daily use. An 8-bit (Q8) build is around 32.6GB. Keep in mind that long contexts add a large KV cache — near the full 262K window it can grow by about 22GB on top of the weights.

Yes. The model is released under the Apache 2.0 license, which allows free personal and commercial use with no licensing fee and no agreement with Google. You only need to include the license text when you redistribute it and note any modifications you make.

Yes. Once the weights are downloaded, gemma-4-31B-it runs fully on your machine with no network connection. Atomic Chat loads it locally, so prompts, documents, and images stay on your device and nothing is sent to a remote server.

It is strong at multimodal understanding — reading documents and PDFs, OCR and handwriting, chart and screenshot analysis — alongside step-by-step reasoning, code, and structured tool use for agentic tasks. Its 140-plus language support and 262,144-token context make it useful for long, multilingual work that you want to keep on local hardware.