TurboQuant

Atomic Chat uses Google’s TurboQuant to compress the memory a model builds while it’s running. Uses 6× less RAM on average, so larger models run without overflowing your machine.

Free
·
Open-source
·
macOS, Windows & Linux
Atomic Chat running a local model on-device
Offline
6x
less KV cache memory
3 bits
per KV value (was 32-bit)
0
accuracy loss
on benchmarks
1 000+
models supported

Run 1000+ models locally

LlamaQwenMistralDeepSeekGemmaOllamaPhiHugging Facegpt-ossCommand RYiLM StudioGrokKimiGLMNemotronStableLMGraniteMiniMaxInternLMFalconDBRX
LlamaQwenMistralDeepSeekGemmaOllamaPhiHugging Facegpt-ossCommand RYiLM StudioGrokKimiGLMNemotronStableLMGraniteMiniMaxInternLMFalconDBRX

What is offline AI?

Offline AI is a language model that runs directly on your own device instead of a remote server. You download a model once — then it answers with the internet off, and nothing you type ever leaves your machine.

A bigger model, a better answer

Larger models give sharper answers on hard problems. TurboQuant cuts their runtime memory, so the session runs further before hitting RAM limits.

Long context without the crash

Sessions that would otherwise crash — long documents, hours of conversation — stay in RAM. TurboQuant compresses context as it builds.

Runs on CPU. No GPU needed.

Your data stays put. TurboQuant runs locally, in the same process as the model.

How TurboQuant works

What fills your RAM during a long session, and why 3-bit compression fixes
it without losing accuracy.

Every conversation fills RAM

When a model generates a reply, it builds a KV cache —
a record of every token in the session. At 10,000 tokens that can add several GB on top of the model weights. Without compression, it eventually crashes. TurboQuant compresses each entry as the cache builds, so the session keeps running.

AI models stored as local files on your disk

Each KV value: from 16 bits to 3

The first stage converts each stored value from a 16-bit float to 3 bits — compact, but imprecise on its own. The second stage adds a 1-bit correction pass that recovers what compression loses. Peak RAM falls by 6×, output unchanged.

AI inference running on your own processor

The full context window,
in under 2GB

Every token you add to a conversation adds to the KV cache. At 100k tokens, the cache alone can exceed your available RAM. TurboQuant compresses it by up to 6×, making long-context sessions practical on standard hardware.

Google Research paper →
On-device only, no cloud connection

TurboQuant vs no TurboQuantoud AI

LM Studio, Ollama, and Jan support GGUF quantization. None ship KV cache compression. The gap shows when models are large or sessions run long.

Atomic Chat · TurboQuant
  • Longer sessions on any model that fits
    your RAM
  • 128k context with significantly less RAM used
  • Works on any CPU, no GPU needed
  • Zero accuracy loss on 5 standard benchmarks
LM Studio, Ollama, Jan · no TurboQuant
  • Larger models exceed available RAM
  • Sessions get slower the longer you chat
  • Session crashes when RAM runs out
  • Need more RAM or a smaller model — not both

TurboQuant takes three steps

Step 1

Download & install

Free for macOS, Windows and Linux. No account needed.

Step 2

Pick a model

Choose from 1000+ models — it downloads to your disk once.

Step 3

Chat. TurboQuant is already on.

No toggle, no settings. KV cache compression runs from the first message.

What bigger models unlock

Analyze a 50-page contract in one session

Paste the full document and ask questions throughout. The model doesn’t lose track of clause 3 by the time you reach clause 30.

Load an entire codebase as context

Point it at a full codebase — thousands of files, the whole project. TurboQuant keeps the growing context in RAM without slowing down.

Sensitive data, capable modelts locally

Some data can't go to a cloud API. The workaround used to be
a smaller, weaker local model. TurboQuant keeps runtime memory in check, so a capable quantized model stays within your RAM limits.

Download to your device

Free, open-source, and fully native — running locally on your own hardware.

macOS
13+ · Apple Silicon
Windows
x64
Linux
x86_64
iOS
App Store
Android
Google Play

Desktop builds v1.1.99 · Free & open-source under Apache-2.0

FAQ

Everything about running larger models without changing hardware

A KV cache compression algorithm from Google Research, published at ICLR 2026. It reduces the working memory a language model uses during a conversation from 16 bits per value to approximately 3 — a 6× reduction in runtime RAM. Atomic Chat applies it automatically to every model you run.

A 34B model in Q4_K_M GGUF format uses roughly 20GB for its weights. With TurboQuant keeping the KV cache compressed, a MacBook Pro or MacBook Air with 24GB handles it well. 36GB gives more headroom. For 16GB machines, models up to 13B run comfortably.

No. TurboQuant compresses the KV cache, so less data moves through memory on each token. Memory bandwidth is the main bottleneck for local LLMs, which means responses stay the same speed or get slightly faster, especially in long chats. The compression itself costs almost nothing.

No. Google tested it on LongBench, Needle-in-Haystack, ZeroSCROLLS, RULER, and L-Eval using Gemma, Llama, and Mistral. Accuracy was indistinguishable from the uncompressed baseline across all tasks.

No. GGUF quantization compresses model weights on disk — applied once when the file is created. TurboQuant compresses the KV cache at runtime while the model is generating. They solve different bottlenecks and both run simultaneously in Atomic Chat.

Nothing to set up. Atomic Chat applies TurboQuant automatically when you load any model.

Any model that runs in Atomic Chat — Qwen, Gemma, DeepSeek, Llama, Mistral and the rest. TurboQuant doesn’t modify model weights, so it applies to every GGUF model you download.

TurboQuant runs in two stages. PolarQuant maps KV vectors into polar coordinates and quantizes to 3 bits. QJL adds 1-bit error correction using a randomized Hadamard transform. The combination achieves near-lossless compression where naive 3-bit quantization would degrade output noticeably.

A 70B model in Q4 GGUF needs around 38–40GB for its weights. TurboQuant handles the KV cache on top of that, but the base weights still need to fit in RAM. For 70B you need 48GB or more of unified memory. On 64GB machines it works well.

Built in the open

Follow the project, file issues, and chat with the people building Atomic Chat.

Stop paying for AI.
Own it.

Download
Download for Windows
Download for Linux
Download for Mac
Available on macOS 13+ (Apple Silicon)
Get on App Store
Get on Google Play
Atomic Chat
Available soon

Almost there. Drop your email and we'll ping you the moment it's live.

Great news, you are in!

Follow us for latest updates
Join Discord
Oops! Something went wrong while submitting the form.