Granite-4.0-H-Small

Updated

Reasoning

Code

Multilingual

Tools

Run

IBM's 32B (9B active) hybrid Mamba-2/MoE instruct model with 128K context, strong tool-calling and multilingual support, under Apache 2.0.

pip install -U transformers accelerate torch
huggingface-cli download ibm-granite/granite-4.0-h-small

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ibm-granite/granite-4.0-h-small",
    "messages": [{"role": "user", "content": "What is retrieval augmented generation?"}]
  }'

from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "ibm-granite/granite-4.0-h-small"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda")
chat = [{"role": "user", "content": "Summarize the Apache 2.0 license in one sentence."}]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(out[0]))

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "not-needed" });
const res = await client.chat.completions.create({
  model: "ibm-granite/granite-4.0-h-small",
  messages: [{ role: "user", content: "Explain mixture-of-experts in two sentences." }],
});
console.log(res.choices[0].message.content);

More models

View all

No items found.

Name	Size / Usage	Context	Input

At a glance

License: Apache 2.0
Architecture: Hybrid Mamba-2/transformer MoE, 32B total / 9B active
Context length: 128K tokens
Languages: 12 (English, German, Spanish, French, Japanese, Portuguese, and more)
Minimum hardware: ~24 GB VRAM (4-bit)
Strengths: instruction following, tool-calling, RAG, multilingual chat

Overview

Granite-4.0-H-Small is a 32-billion-parameter instruct model from IBM's Granite Team, released on October 2, 2025 as part of the Granite 4.0 language model family. It is finetuned from Granite-4.0-H-Small-Base through supervised finetuning, reinforcement-learning alignment, and model merging. Despite the 32B total, it is a Mixture-of-Experts model that activates only about 9B parameters per token, routing through 72 experts with 10 active at a time. The "H" denotes its hybrid Mamba-2/transformer design, which IBM positions for enterprise assistants and RAG workloads.

What it's good at

IBM tuned this release for instruction following and tool-calling, the two capabilities that matter most for agentic enterprise apps. The model handles function calling using OpenAI-style schemas, retrieval-augmented generation over supplied documents, summarization, classification, extraction, question answering, and code tasks including fill-in-the-middle completion. It supports 12 languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese, though most training data is English and non-English quality can trail. The 128K-token context window covers long documents and multi-file code.

Running locally

The hybrid Mamba-2/MoE architecture is the practical draw here. Because only 9B parameters are active and Mamba-2 layers avoid a growing KV cache, IBM reports more than 70% lower memory and roughly 2x faster inference than comparable dense models in long-context and multi-session scenarios. A 4-bit quantized build fits on a single 24 GB GPU; full BF16 weights need around 64 GB. It runs in transformers, vLLM, and llama.cpp, and quantized GGUFs work with Ollama.

License

Granite-4.0-H-Small is released under Apache 2.0, which allows commercial use, modification, and redistribution without copyleft obligations. The Granite 4.0 weights are cryptographically signed, and the family is certified under ISO 42001 for responsible AI management.

Desktop

macOS

(M1 or better)

Download

Windows

(x64)

Download

Linux

(x86_64)

Download

Frequently asked questions

Granite-4.0-H-Small is a Mixture-of-Experts model with 32 billion total parameters, of which roughly 9 billion are active per token during inference. It routes through 72 experts with 10 active at a time, keeping compute and memory closer to a 9B dense model than a full 32B one.

Granite-4.0-H-Small has a 128K-token context window. IBM trained the Granite 4.0 models on samples up to 512K tokens and validated performance up to 128K, so it handles long documents, multi-file code, and extended RAG inputs comfortably.

Yes. Granite-4.0-H-Small is released by IBM under the Apache 2.0 license, which permits commercial use, modification, and redistribution. The weights are publicly downloadable from Hugging Face, and the Granite 4.0 family is cryptographically signed and certified under ISO 42001.

Because only 9B of its 32B parameters are active, the hybrid Mamba-2/transformer design cuts memory use sharply. A 4-bit quantized build runs on a single 24 GB GPU; the full BF16 weights need roughly 64 GB. IBM cites over 70% lower memory and about 2x faster inference than comparable dense models in long-context use.

It officially supports 12 languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Most training data is English, so non-English performance can lag, but few-shot examples help. Users may also finetune it for additional languages.