Overview
Granite-4.0-H-Small is a 32-billion-parameter instruct model from IBM's Granite Team, released on October 2, 2025 as part of the Granite 4.0 language model family. It is finetuned from Granite-4.0-H-Small-Base through supervised finetuning, reinforcement-learning alignment, and model merging. Despite the 32B total, it is a Mixture-of-Experts model that activates only about 9B parameters per token, routing through 72 experts with 10 active at a time. The "H" denotes its hybrid Mamba-2/transformer design, which IBM positions for enterprise assistants and RAG workloads.
What it's good at
IBM tuned this release for instruction following and tool-calling, the two capabilities that matter most for agentic enterprise apps. The model handles function calling using OpenAI-style schemas, retrieval-augmented generation over supplied documents, summarization, classification, extraction, question answering, and code tasks including fill-in-the-middle completion. It supports 12 languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese, though most training data is English and non-English quality can trail. The 128K-token context window covers long documents and multi-file code.
Running locally
The hybrid Mamba-2/MoE architecture is the practical draw here. Because only 9B parameters are active and Mamba-2 layers avoid a growing KV cache, IBM reports more than 70% lower memory and roughly 2x faster inference than comparable dense models in long-context and multi-session scenarios. A 4-bit quantized build fits on a single 24 GB GPU; full BF16 weights need around 64 GB. It runs in transformers, vLLM, and llama.cpp, and quantized GGUFs work with Ollama.
License
Granite-4.0-H-Small is released under Apache 2.0, which allows commercial use, modification, and redistribution without copyleft obligations. The Granite 4.0 weights are cryptographically signed, and the family is certified under ISO 42001 for responsible AI management.
