GLM-4.7-Flash

Updated
25.06.2026
Tools
Thinking
Reasoning
Code

huggingface-cli download zai-org/GLM-4.7-Flash
from transformers import AutoModel
model = AutoModel.from_pretrained("zai-org/GLM-4.7-Flash")

More models

NameSize / UsageContextInput
GLM-5.1
198KText
GLM-5.2
1MText

At a glance

  • License: Mit
  • Context length: 128K tokens
  • Languages: Multilingual
  • Minimum hardware: ~64 GB VRAM
  • Strengths: fast MoE chat, coding and tool use

Overview

GLM-4.7-Flash is a 31.2B-parameter open-weight model from zai-org (Z.ai), the team behind the GLM-4.5 and GLM-4.7 series. It uses a Mixture-of-Experts design — the "glm4_moe_lite" architecture — that pairs Multi-head Latent Attention with a streamlined expert layout, so only a small fraction of the 31.2B parameters fire on any given token. That keeps it fast while holding a 128K-token context window.

Because the weights are public and the model is small enough for a single GPU, you can run GLM-4.7-Flash entirely on your own machine through Atomic Chat. Nothing leaves your hardware: prompts, code, and documents stay local, and the model works offline once downloaded.

What it is good at

GLM-4.7-Flash ships with tool use, reasoning, and a thinking mode, which lines up with a few practical jobs:

  • Local coding assistant — its code capability handles writing functions, refactoring, and debugging across files, with reported speeds around 60-100 tokens per second on consumer GPUs.
  • Agentic workflows — the tools capability lets it call functions and chain steps, so you can wire it into local agents that read files or hit APIs without sending data to a cloud provider.
  • Step-by-step reasoning — the thinking mode works through math, logic, and multi-part questions before answering, and the 128K context lets it reason over long documents or large code repositories in one pass.

Running it locally

The model is 31.2B parameters total with a 128K context window. A 4-bit quantization runs on roughly 16GB of VRAM (an RTX 3090 or 4090, or an M-series Mac), while full BF16 precision wants around 64GB. Pull the weights with the Hugging Face CLI:

huggingface-cli download zai-org/GLM-4.7-Flash

From there you can serve it with Transformers or vLLM, run a quantized GGUF in llama.cpp, or load it one-click in Atomic Chat and start chatting on-device.

License

GLM-4.7-Flash is released under the MIT license. That permits commercial use, modification, and redistribution, as long as you keep the copyright and license notice — among the most permissive terms an open-weight model can carry.

Desktop
macOS
(M1 or better)
Download
Windows
(x64)
Download
Linux
(x86_64)
Download

Frequently asked questions

GLM-4.7-Flash is a 31.2B-parameter open-weight language model from zai-org (Z.ai), the smallest member of the GLM-4.7 family. It uses a lightweight Mixture-of-Experts architecture and supports a 128K-token context, with capabilities for code, reasoning, tool calling, and a thinking mode. It is positioned as a fast option for coding and agentic work that can run on a single GPU.

A 4-bit quantized version runs on about 16GB of VRAM, which covers cards like the RTX 3090 and 4090 or an M-series Mac with enough unified memory. Running it at full BF16 precision needs roughly 64GB. Thanks to the MoE design you can also offload to system RAM and trade speed for lower VRAM use.

Yes. The weights are released under the MIT license, so you can download and run them at no cost. Running the model locally in Atomic Chat means there are no API fees or per-token charges — you only pay for your own hardware and electricity.

Yes. Once you download the weights, GLM-4.7-Flash runs fully on-device with no internet connection required. Your prompts, code, and files never leave your machine, which makes it suitable for private or air-gapped work.

It is built for coding, reasoning, and agentic tasks. The code capability suits writing and debugging across files, the tools capability lets it call functions for local agent setups, and the thinking mode plus 128K context handle multi-step reasoning over long inputs. Reported throughput is around 60-100 tokens per second on consumer GPUs.