Overview
GLM-4.7-Flash is a 31.2B-parameter open-weight model from zai-org (Z.ai), the team behind the GLM-4.5 and GLM-4.7 series. It uses a Mixture-of-Experts design — the "glm4_moe_lite" architecture — that pairs Multi-head Latent Attention with a streamlined expert layout, so only a small fraction of the 31.2B parameters fire on any given token. That keeps it fast while holding a 128K-token context window.
Because the weights are public and the model is small enough for a single GPU, you can run GLM-4.7-Flash entirely on your own machine through Atomic Chat. Nothing leaves your hardware: prompts, code, and documents stay local, and the model works offline once downloaded.
What it is good at
GLM-4.7-Flash ships with tool use, reasoning, and a thinking mode, which lines up with a few practical jobs:
- Local coding assistant — its
codecapability handles writing functions, refactoring, and debugging across files, with reported speeds around 60-100 tokens per second on consumer GPUs. - Agentic workflows — the
toolscapability lets it call functions and chain steps, so you can wire it into local agents that read files or hit APIs without sending data to a cloud provider. - Step-by-step reasoning — the thinking mode works through math, logic, and multi-part questions before answering, and the 128K context lets it reason over long documents or large code repositories in one pass.
Running it locally
The model is 31.2B parameters total with a 128K context window. A 4-bit quantization runs on roughly 16GB of VRAM (an RTX 3090 or 4090, or an M-series Mac), while full BF16 precision wants around 64GB. Pull the weights with the Hugging Face CLI:
huggingface-cli download zai-org/GLM-4.7-Flash
From there you can serve it with Transformers or vLLM, run a quantized GGUF in llama.cpp, or load it one-click in Atomic Chat and start chatting on-device.
License
GLM-4.7-Flash is released under the MIT license. That permits commercial use, modification, and redistribution, as long as you keep the copyright and license notice — among the most permissive terms an open-weight model can carry.
