GLM-5.1

Updated
25.06.2026
Tools
Thinking
Reasoning
Code

huggingface-cli download zai-org/GLM-5.1
from transformers import AutoModel
model = AutoModel.from_pretrained("zai-org/GLM-5.1")

More models

NameSize / UsageContextInput
GLM-5.2
1MText
GLM-4.7-Flash
128KText

At a glance

  • License: Mit
  • Context length: 198K tokens
  • Languages: en, zh
  • Minimum hardware: ~422 GB VRAM
  • Strengths: reasoning and on-device inference

Overview

GLM-5.1 is a 753.9B-parameter Mixture-of-Experts model from zai-org (Z.AI). The MoE design (256 routed experts with top-8 routing plus one shared expert) activates roughly 40B parameters per token, so the model reasons with a large knowledge base while keeping each forward pass cheaper than a dense model of the same size. Thinking mode is on by default, and the model is tuned for agentic coding work — it posts a 58.4 on SWE-Bench Pro.

The local-AI angle is straightforward: the weights are public on Hugging Face, so you can pull GLM-5.1 onto your own hardware and run it privately, offline, with no prompts leaving your machine. Atomic Chat loads open-weight models like this on-device, so your conversations and code never touch a third-party server.

What it is good at

GLM-5.1's strengths line up with its stated capabilities — tools, thinking, reasoning, and code. The model holds up across long, multi-step tasks rather than single-shot answers.

  • Agentic coding — built for repo-level work: editing across files, running experiments, and finding blockers, which is what drives its SWE-Bench Pro and Terminal-Bench results.
  • Tool calling — native support for invoking functions and external tools, so it can drive an agent loop that reads files, runs commands, and acts on the output.
  • Long-context reasoning — the 198K window lets it keep a large codebase, spec, or document set in working memory while it thinks through a problem step by step.

Running it locally

This is a heavy model. At 753.9B total parameters, the FP8 checkpoint needs around 860GB of memory to serve, and full precision pushes past 1.5TB — multi-GPU or cluster territory. Quantized GGUF builds bring it down sharply: 4-bit lands near 476GB, and aggressive 2-bit dynamic quants fit in roughly 240GB, which a 256GB Mac Studio or a multi-card workstation can hold. Expect a few tokens per second on consumer hardware. The context window runs to 198K tokens.

huggingface-cli download zai-org/GLM-5.1

From there you can serve it with vLLM or SGLang (both have GLM-5.1 recipes), load a quantized GGUF through llama.cpp, or open it in Atomic Chat with one click and run it fully on-device.

License

GLM-5.1 is released under the MIT license, one of the most permissive options available. You can download, modify, fine-tune, and deploy it commercially with no royalty fees or usage restrictions, which is what makes self-hosting it in Atomic Chat possible.

Desktop
macOS
(M1 or better)
Download
Windows
(x64)
Download
Linux
(x86_64)
Download

Frequently asked questions

GLM-5.1 is an open-weight large language model from zai-org (Z.AI), built as a 753.9B-parameter Mixture-of-Experts with about 40B parameters active per token. It is tuned for agentic coding and long-horizon reasoning, and it scores 58.4 on SWE-Bench Pro. The weights are public on Hugging Face under the MIT license, so it can run locally in apps like Atomic Chat.

The full FP8 checkpoint needs roughly 860GB of memory, and full precision exceeds 1.5TB, which means a multi-GPU rig or cluster. Quantized GGUF builds cut this down a lot: 4-bit is around 476GB and a 2-bit dynamic quant fits in about 240GB, runnable on a 256GB Mac Studio or a multi-card workstation. On consumer hardware expect a few tokens per second.

Yes. GLM-5.1 is released under the MIT license, so you can download, fine-tune, and deploy it at no cost. The license also permits commercial use with no royalties or usage restrictions. The only practical cost is the hardware needed to run a model of this size.

Yes, once the weights are downloaded the model runs entirely on your own machine with no internet connection. Loading it through Atomic Chat keeps every prompt and response on-device, so nothing is sent to an external server. The main constraint is having enough memory for the model or a sufficiently quantized version.

It is strongest at agentic coding — working across a repository, calling tools, running experiments, and solving multi-step problems, which is reflected in its SWE-Bench Pro score. Native tool calling and a 198K context window also make it well suited to long-context reasoning over large codebases or document sets. Thinking mode is enabled by default for these tasks.