Self-Hosted LLM on macOS: Which Models Run Fast on Mac (2026)

Usually users point the finger at “long and hard setup” as the main problem that holds them back from using local LLM. But the setup got easier and there is a way to delegate the whole process (we're telling further in the text).

So now it seems that reasons to run a self-hosted LLM are clear: private and offline, even free in some way – sounds tempting, but not really convincing to make an informed decision to come down from the clouds.

Then why should you run self-hosting LLM? And if yes, which model to run? We chose our favourites and ran some tests to see find it out.

How Self-Hosted LLM works

Brief reminder on what is self-hosted LLM, if you’re familiar with the theory – you can skip this par to see models and our tests.

When you use ChatGPT or Claude, your message leaves your device, travels to a remote server, gets processed there, and the response comes back. The model, the computer, and your data are all on someone else's infrastructure.

With a self-hosted LLM, all of that happens on your machine:

You type a message
Your CPU or GPU processes it
You get a response

Nothing leaves your device and the model runs as a local process, the same way any other application does.

The software layer looks like this: a model file (usually GGUF format) sits on your disk. An inference engine (most commonly llama.cpp or Ollama) loads it into RAM or VRAM and runs it. Ollama exposes an OpenAI-compatible API at localhost:11434, which means any tool that already works with the OpenAI API (LangChain, n8n, Open WebUI, your own scripts) can point at your local model instead with one config change.

The main constraint is your hardware. The model still has to fit in memory and now it’s your device, and inference speed depends on how fast your hardware moves data. ‘

Is Self-Hosting an LLM worth it?

When it makes sense

Privacy and costs

The clearest case is data you can't send out: client contracts, medical records, internal code – the model runs as a local process, and no network requests leave your machine.

The second case is volume. Per-token API costs add up fast at scale:

Model	Output cost per 1M tokens
GPT-5.5	$30.00
Claude Opus 4.8	$25.00
Claude Sonnet 4.6	$15.00
Gemini Flash	$0.40
Qwen 3.6 35B (Apple Silicon)	~$0.025

A few less obvious reasons:

Model version stability

Cloud APIs update models without warning: a prompt that worked in March may behave differently in June – this is why we saw this heated wave about Opus 4.7 instability on X.com before Opus 4.8’s launch.

Self-hosting pins the version and it doesn't change unless you change it. For anything production-facing, silent behavior drift is a real problem.

Rate limits are a product problem

If you're building on top of an API and your users hit the ceiling, your product goes down. Local models have no rate limits beyond your hardware.

Fine-tuning. Cloud APIs don't let you modify weights. If you need the model to learn your domain's terminology or output format at the model level — not via system prompt — self-hosting is the only path.

Uncensored behavior. Some use cases need a model without safety filters: security research, legal document analysis, red-teaming. Open-weight models can be run without restrictions.

When it doesn't

Local 30B models are not GPT-5.5 or Claude Opus 4.8 and the gap shows up in large and tricky tasks:

Debugging a non-obvious concurrency issue across a large unfamiliar codebase.
Writing a prompt that needs to work across ten edge cases without iteration

Or any task where you've given it three tries and it still misses the constraint you stated in message one.

Technically, you can run models locally that approach frontier quality: DeepSeek R1 (full) is 671B parameters and genuinely competitive on reasoning benchmarks. It's also ~390GB at Q4, which means you need an M4 Ultra with 192GB RAM or a multi-machine setup. This is not a customer-friendly possibility – barely anyone either has that much, or longing to have it specifically for local AI.

Two other situations where self-hosting loses:

Early-stage prototyping, where you want to move fast and the API is just faster;
Teams with no one willing to own the maintenance: model updates, quantization format changes, inference engine upgrades.

Hardware reality check for macOS

Local LLMs are less hardware-constrained than most people assume. Here's what different hardware tiers realistically get you:

Model size	Minimum RAM	Comfortable	Example hardware	Good for
7B	8GB	16GB	M2 MacBook Air	Quick Q&A, short summaries: noticeably limited on multi-step tasks
14B	16GB	24GB	M2 Pro MacBook Pro	Everyday tasks: short code, document Q&A, summarization
27–31B	32GB	64GB	M3 Max, M4 Pro	Practical coding, long-form writing, document analysis – perfect for daily work
35B	48GB	64GB	M3 Max, M5 Max	Same as 27–31B but faster; best general-purpose tier if you have the RAM
70B	64GB	80GB+	M3 Ultra, M4 Max	Complex reasoning, long codebases, best local quality available

Keep in mind:

The underlying reason hardware matters: local models generate one token at a time, and for each token the engine has to read the entire model from memory. So speed depends on how fast your hardware can move data around, not on raw processing power.

16 GB of RAM or VRAM is a comfortable starting point – you can run 14B dense models or MoE models like Qwen 3.6-35B-A3B, which is enough for:

general chat and Q&A;
writing and editing;
code generation and review;
document processing on moderate-length files.

With 24 GB or more, you move into 32B territory and unlock tasks that actually require stronger reasoning:

complex multi-step analysis;
longer context without degradation (fitting more text in one request + how model remembers the context);
multimodal input (sending a screenshot or image);
heavier RAG pipelines – where you're uploading large document chunks alongside conversation history.

Best Self-Hosted models for macOS: June, 2026

Model	Disk (Q4_K_M)	Speed	RAM	Commercial use	Best for
Gemma 4 31B	~19GB	27 tok/sec	32GB	Yes	One-shot coding, structured tasks
Qwen 3.6 27B	~16GB	16–32 tok/sec	32GB	Yes	Writing, Q&A, long-form content
Qwen 3.6 35B	~22GB	72 tok/sec	48GB	Yes	General daily use, speed-sensitive work
Gemma 4 12B	~18GB	~70 tok/sec	16GB	Yes (Apache 2.0)	Multimodal tasks, 16GB machines
Llama 4 Scout	~55GB	~32 tok/sec	64GB	Yes*	Long-context reasoning, agentic workflows

Qwen 3.6 27B

Generates more than you asked for, which is sometimes exactly what you want. Good for tasks where you'd rather have too much than too little: writing, Q&A, explanations.

Hardware tier: 32GB minimum. Same tier as Gemma 4 31B, so the choice between them is about the task, not the hardware.
Best for: Writing, summarization, detailed Q&A, creative work. If you're drafting content, building a knowledge base, or want the model to explain something thoroughly, Qwen 27B is the better default.
Not good for: Structured coding tasks where token count affects correctness. More output is not more quality. In our test it generated 5x more code than Gemma on the same prompt and the game logic broke.

Qwen 3.6 35B

The fastest model in our tests at 72 tok/sec, despite being larger than the 27B. MoE architecture activates only a subset of parameters per token, so it moves less data per inference step than a dense 27B. It finished a car game prompt in 1m 52s vs 5m 12s for the 27B.

Hardware tier: 48GB minimum (M3 Max 48GB, M5 Max 64GB). Going from 32GB to 48GB is the only RAM upgrade that unlocks a meaningfully faster model class.
Best for: General daily use where speed matters. At 72 tok/sec you're not watching a progress bar. Better than the 27B at most tasks if you have the RAM.
Not good for: Very short tasks where the MoE routing overhead shows up. Also not worth it if you're on 32GB (the model simply won't load).

Qwen 3.6 27B vs Qwen 3.6 35B: what’s the difference

Both models below are almost the same: they are from Alibaba, both are part of the same Qwen 3.6 series, both are Apache 2.0, and both run well on Apple Silicon. 27B activates all its parameters on every token, 35B activates only a subset. That's why 35B is faster, but it skips steps the 27B would catch.

We ran them head-to-head to see when that matters.

Task: draw waves using HTML, one prompt, MacBook Pro M5 Max 64GB, TurboQuant enabled on both.

Compared Qwen3.6 35B and 27B in the same conditions with Google TurboQuant

Device: MacBook Pro M5Max 64GB RAM

Outputs characteristics:
Qwen3.6 35B: 6672 tokens, 2m 10s, 65 tok/s
Qwen3.6 27B: 7344 tokens, 5m 22s, 24 tok/s

Conclusion: Both models were asked to draw waves using… pic.twitter.com/RMXhR4EUFj
— atomic.chat (@atomic_chat_hq) April 22, 2026

Result:

The 35B finished in 2m 10s at 65 tok/sec vs the 27B took 5m 22s at 24 tok/sec.
Despite high speed, the 35B output was messier: jagged wave edges and jittery animation spoil the overall result. Meanwhile, the 27B output was cleaner and more consistent.

MoE activates a fraction of weights per token and commits quickly; the dense 27B reasons more thoroughly before generating. On tasks with visual or structural logic (layouts, animations, anything requiring a plan) that difference shows up in the output.

If you need the output to follow a structure, hold visual consistency, or work through a multi-step plan, choose 27B. And if you need a fast response that you can iterate several times with a model keeping its train of thought – choose 35B.

Gemma 4 12B

Google's newest small model, released June 3, 2026. The only model in this list that handles text, images, audio, and video natively. Google claims benchmark scores close to the Gemma 4 27B at less than half the memory footprint.

Hardware tier: 16GB RAM, the only practical option on a standard MacBook Air. At Q4 it fits in ~7GB VRAM, so it also runs on older NVIDIA cards (RTX 3080 and up).
Best for: Machines that can't run 30B+ models. Short code, document Q&A, multimodal tasks (image analysis, audio transcription). A 16GB Mac running Gemma 4 12B is more useful than a 16GB Mac running nothing.
Not good for: Multi-step reasoning or anything requiring sustained coherence across hundreds of lines. The gap with 30B+ models is noticeable once tasks get complex. If you can run a 30B model, you should.

We tested it on an RTX 4090 on a demanding prompt: write a self-contained HTML5 canvas animation with real physics, no libraries, three scenes: a Galton board, two blocks colliding off a wall, and a triple pendulum. It produced 8,900 tokens at 80 tok/s using 9GB VRAM. All three scenes rendered. For a 12B model on 9GB VRAM, that's a strong result.

New Google Gemma 4 12B claims near-26B performance - we tested both!

We ran both models locally on one RTX 4090 and gave each the same task: write a self-contained HTML5 canvas animation with real physics in one file without libraries. Three scenes - a Galton board, two blocks… pic.twitter.com/Zy04PD12GR
— atomic.chat (@atomic_chat_hq) June 3, 2026

If your machine tops out at 16GB RAM or VRAM, 12B is the right pick. It handles more than its size suggests.

Gemma 4 31B

The model stops when it has enough, not when it runs out of things to say. In our Pac-Man test it produced working game logic in 3m 51s using 6,209 tokens. Qwen took 18 minutes and 33,946 tokens for the same prompt, and the output broke.

Hardware tier: 32GB minimum (M3 Max, M4 Pro, M5 Max). Works at 32GB but 64GB gives headroom if you're running other apps alongside.
Best for: One-shot coding, structured output, tasks where correctness matters more than length. Wall collisions, click handlers, API response parsing. Anything where "it works on the first run" is the bar.
Not good for: Long-form writing or anything requiring elaboration. The conciseness that makes it good at code makes it thin on explanations. Ask it for a 2,000-word article and you'll get 800 words.

Llama 4 Scout

109B total parameters, 17B active per token. The MoE design means inference speed is closer to a 17B dense model, but the quality benefit comes from having access to a much larger pool of expert weights. Context window: 10 million tokens.

Hardware tier: 64GB RAM minimum (~55GB on disk at Q4_K_M). Tight on a 64GB Mac but works. Comfortable on M3 Ultra or M4 Max with 80GB+.
Best for: Complex reasoning, long documents, agentic workflows where 30B models lose the thread. If Qwen 35B or Gemma 4 31B keeps breaking on multi-step tasks, Scout is where to go next.
Not good for: 32GB or 48GB machines: it won't load. Also not necessary for straightforward coding or writing where Gemma 4 31B or Qwen 3.6 35B do the same job at a fraction of the RAM cost.

How fast are Local LLMs: our tests

Benchmark leaderboards usually score models on fixed test sets: multiple choice, short function completions, summarization with known answers. Speed isn't one of the metrics, and almost none of the tasks require holding a plan together past a few hundred tokens.

We tested tasks closer to real use on Atomic Chat: generate working code from a single prompt, no follow-up and no corrections. The reason is that a game prompt forces the model to:

Understand a full spec;
Hold logic together across hundreds of lines;
Produce code that actually runs.

Those are the same things you need for any practical task: a script, a component, a working feature.

Test 1: One-shot Pac-Man game

/1 Gemma 4 31B just crushed Qwen 3.6 27B in a local LLM gamedev contest inside @atomic_chat_hq (prompt is below)

Device: MacBook Pro M5 Max, 64GB RAM

Results:
Qwen 3.6 27B: 32 tokens/sec · 18m 04s · 33,946 tokens
Gemma 4 31B: 27 tokens/sec · 3m 51s · 6,209 tokens

So what is… pic.twitter.com/wqyWyjXX2u
— Chubby♨️ (@kimmonismus) April 30, 2026

Hardware: MacBook M5 Max, 64GB RAM.
Task: to write a complete, playable Pac-Man game in a single prompt.
What we measured: gfeneration speed, total tokens, and whether the output actually worked.

Model	Speed	Time	Tokens
Qwen 3.6 27B	32 tok/sec	18m 04s	33,946
Gemma 4 31B	27 tok/sec	3m 51s	6,209

Gemma finished in under 4 minutes. Qwen took 18. Gemma also won on output quality, more on that below.

Test 2 — HTML/Canvas car game with physics and parallax scrolling

HOLY MOLY running a 35B model locally on a MacBook shouldn’t be THIS FAST 🤯

Spent my weekend in @atomic_chat_hq testing Qwen 35B vs. Qwen 27B on my local machine.

I had them generate a fully animated HTML/Canvas car mini-game (demo below),

... and both models breezed through… pic.twitter.com/WXboJCSDZn
— Charly Wargnier (@DataChaz) April 28, 2026

Hardware: MacBook (Apple Silicon).
Task: to generate a fully animated HTML/Canvas car mini-game including vehicle physics and parallax background scrolling, one prompt.
What we measured: whether the larger MoE model was worth the extra RAM requirement.

Model	Speed	Time	Tokens
Qwen 3.6 27B	16 tok/sec	5m 12s	5,623
Qwen 3.6 35B	72 tok/sec	1m 52s	8,070

The 35B ran at 72 tok/sec while the 27B ran at 16: 4.5x faster, despite having more total parameters. That's the MoE architecture: only a subset of weights activate per token. Atomic Chat also runs TurboQuant under the hood, which compresses the KV cache specifically for Apple Silicon.

Atomic Chat vs standard llama.cpp (MacBook M5 Max, Gemma 4 26B)

Multi-Token Prediction (MTP) for LLaMA.cpp!

Running Gemma4 local model 1.5x faster.

We patched LLaMA.cpp. Quantized Gemma 4 assistant models into GGUF format. We ran tests on a MacBook Pro M5Max. Gemma 26B with MTP drafts tokens 40% faster. Benchmarks, source code and models 👇 pic.twitter.com/hHH1cu1jLi
— atomic.chat (@atomic_chat_hq) May 7, 2026

The two tests above ran in Atomic Chat, not unpatched llama.cpp. To get MTP working, Atomic Chat patched llama.cpp directly and quantized Gemma 4 assistant models into GGUF format. On a single Fibonacci generation task, Gemma 26B went from 73 tok/sec (unpatched) to 121 tok/sec. Across longer generation tasks on M5 Max:

Multi-Token Prediction gives a 40-62% speed increase on the same hardware.

Tokens Per Second vs Output Quality: which matters more?

In Test 1, Gemma 4 31B was slower than Qwen 27B (27 vs 32 tok/sec) and generated far fewer tokens (6,209 vs 33,946) – but still won. Wall collisions worked, ghost interactions were correct, click responses were smooth. Meanwhile, Qwen produced nearly 5x more output, more creative and visually ambitious, but the game logic broke.

Token/sec measures how fast the model runs – but when it comes to the quality, it can’t say for sure.

For coding, what matters is whether the output runs on the first try. Our test: 6,000 correct tokens beat 34,000 broken ones.

What can you do with a Self-Hosted LLM?

The clearest use cases: code generation and document analysis. Gemma 4 31B on a 32GB Mac produced working game code from a single prompt: no iteration, no cleanup.

For document work: contracts, internal docs, PDFs – use Qwen 3.6 27B. It handles long-form Q&A well. And Llama 4 Scout's 10M token context means you can load an entire document set without chunking. Nothing leaves the machine either way. For batch work (running the same extraction or summarization task across hundreds of files), any of the 30B+ models will run overnight on a Mac Mini without hitting rate limits or accumulating token costs.

RAG pipelines are another strong fit: query a local knowledge base with Qwen 3.6 35B on 48GB, get answers from your own data, no API involved.

Three cases where frontier APIs still win:

The task needs information from after the model's training cutoff;
You're tracing logic through a large codebase you didn't write;
The reasoning chain runs long enough that a 30B model loses track of what it established a few steps back.

Self-hosting works when the task is repetitive, the data is sensitive, or you've tested it and the output is good enough.

How to run Self-Hosted LLMs: multiple models, no Terminal

Getting one model running takes 10 minutes – even Ollama has its own CLI now. But when it comes to managing several (which is the best strategy): Gemma for code, Qwen for writing, Gemma 4 12B for quick tasks – this is where the overhead adds up when you're doing it manually through Ollama or llama.cpp.

Atomic Chat handles the switching, model management, and quantization configuration. It runs natively on Mac, uses TurboQuant and MTP for inference speed, and gives access to 1,000+ models with no CLI setup. Both tests above were run in Atomic Chat: no manual quantization configuration, fully offline, no API limits.

Run it on your Mac, connect it to your iPhone (or Android) – and get a private AI assistant without any cloud involved.

→ Try Atomic Chat for running multiple models in one app

FAQ

How much RAM do I need to run a self-hosted LLM?

8GB gets you a 7B model. For anything practical (27B to 35B range), you need 32-48GB. On Apple Silicon, unified memory means there's no separate GPU VRAM to worry about.

Is a self-hosted LLM actually private?

After the initial model download, nothing leaves your machine. No API calls, no vendor logging, no data used for training. It runs as a local process.

Is self-hosting an LLM free?

The models are free, the cost is hardware and electricity. On an M-series Mac you already own, the marginal cost is electricity, roughly $5-20/month depending on usage. The upfront cost is the machine itself.

Can I use a self-hosted LLM on my phone?

Yes, if your Mac is running the model. Atomic Chat lets your phone connect to the model over your local network: the inference happens on the Mac, and your phone sends and receives text through it. You need both devices on the same network.
Try Atomic Bot on your desktop:
→ on macOS

And connect your models to your phone:
→ Android
→ iOS

What's the best self-hosted LLM right now?

For coding: Gemma 4 31B. For writing and general use: Qwen 3.6 35B if you have 48GB RAM, Qwen 3.6 27B if you don't. For 16GB machines: Gemma 4 12B. For best local quality on 64GB: Llama 4 Scout. There's no single best model – it depends on the task.

Check our detailed breakdown on best LLMs for coding, both Cloud and Local, in 2026.

Can a bigger model be faster?

Yes, in our test, Qwen 3.6 35B ran at 72 tok/sec while Qwen 3.6 27B ran at 16 tok/sec. The 35B uses Mixture of Experts architecture, which activates fewer parameters per token. Size and speed don't map linearly.

Conclusion

It’s enough for you to have 16GB Apple Silicon to delegate your regular coding or data work to self-hosting. Especially, when the setup is just 10 minutes and the hardware pays for itself in under three months at just 50M tokens/month compared to GPT-5.5 at $30/M. For sensitive data, the cloud isn't an option anyway.

Model selection is the harder part, and the tests made that less obvious: speed can’t predict output quality, as our tests have shown. It also still breaks on current information, long reasoning chains, and at the point when there is no one to maintain the setup.

Best Open Source LLM in 2026: 10 Models Ranked

The 10 best open source LLMs of 2026, ranked and tested — Qwen3.6, GLM-5.2, Kimi K3, DeepSeek-V4 and more, with benchmarks, licenses, and the hardware to run each.

LLM updates

7/21/26

16 min

6 Offline AI Apps for iPhone and Android (2026)

Which offline AI app actually works on your phone? Seven apps compared by speed, RAM, and privacy — with device benchmarks and honest recommendations.

LLM updates

6/9/26

8 min read