Home
blog
Running Out of Tokens? Why Local LLMs Are Starting to Make More Sense

Apr 1, 2026 7 min read

Running Out of Tokens? Why Local LLMs Are Starting to Make More Sense

Token-based AI pricing made sense when high-quality models only lived in the cloud. That’s no longer the case.

TL:DR

Token-based AI pricing made sense when high-quality models only lived in the cloud. That’s no longer the case.

Today’s local LLMs (like Gemma, Qwen, and Mistral) are powerful enough to handle a large percentage of real-world workloads—especially things like RAG, coding assistants, document analysis, and internal tools. And once you move those workloads local, the economics change completely.

Instead of paying per prompt, per retry, and per experiment, you pay once in compute and iterate freely.

Cloud models still win for frontier reasoning and scale. But for repeatable, private, and high-volume workflows, paying a token tax on every interaction is increasingly hard to justify.

For a long time, I treated token billing as the cost of being in the room. If you wanted good AI, you paid per prompt, per response, per retrieval chunk, per long context, per tool call, and then you tried not to think too hard about the invoice at the end of the month.

I don’t really think that model makes sense anymore.

After spending more time building local AI workflows, I’ve started to see token-based APIs differently. They still absolutely have a place. But for a growing class of real work, they no longer feel like the obvious default. They feel like rented intelligence. And once you notice that, it gets hard to unsee.

Why I Looked at This

A lot of my recent thinking came from practical work, not theory.

I kept coming back to local RAG pipelines: Ollama running a local model, document ingestion, embeddings, vector search, and a simple application layer on top. In my case, that meant experimenting with Gemma 3:27B locally, wiring it into retrieval, and building workflows that could continuously ingest and query new documents without retraining the base model. I also looked at hybrid patterns where classic RAG does most of the heavy lifting and a lightweight graph layer helps when the problem becomes more structured or multi-hop.

That changed the question for me.

Instead of asking, “Can a local model do enough?”, I started asking, “Why am I paying a meter every time I want to think with my own data?”

The Real Cost of Token-Based Work

The biggest issue is not that API pricing is always outrageous in absolute terms. The issue is that the pricing model turns iteration into a billable event.

Every retry costs something. Every evaluation run costs something. Every chunk you stuff into context costs something. If you add tools, that can add overhead too. Anthropic explicitly documents an additional tool-use system prompt measured in the hundreds of tokens before you even count your own payload. And across OpenAI, Anthropic, and Google, the unit you are still fundamentally buying is usage.

At current list prices, GPT-5.4 is $2.50 per million input tokens and $15 per million output tokens; GPT-5.4 mini is $0.75 and $4.50. Claude Sonnet 4.6 is $3 and $15. Gemini 2.5 Pro is $1.25 input and $10 output for prompts up to 200k tokens, while Gemini 2.5 Flash is $0.30 input and $2.50 output. Those prices are not irrational for frontier capability, but they add up quickly once a workflow becomes repetitive instead of occasional.

Take a very ordinary internal workflow: a document assistant, codebase helper, or research pipeline that processes 50 million input tokens and 10 million output tokens in a month. At current rates, that works out to about $82.50 on GPT-5.4 mini, $300 on Claude Sonnet 4.6, and $40 on Gemini 2.5 Flash. Push the same workload to GPT-5.4 and you are at $275. That is one workflow, at list price, before you add search, storage, evaluation, or multiple teammates hammering it every day.

Yes, vendors offer discounts. OpenAI discounts cached input heavily, Anthropic offers lower batch pricing, and Google does the same for Gemini batch workloads. But that doesn’t change the core dynamic. You are still spending time optimizing around a meter. You are still designing your system to avoid using it too much.

That is the part I’ve become less willing to accept.

What Changed: Local Models Grew Up

A year or two ago, the local argument was mostly philosophical: privacy, control, independence. The quality gap was often too large for many practical tasks.

That gap has narrowed.

Google’s Gemma 3 line is a good example of why this conversation has changed. Gemma 3 supports multimodal input, up to a 128k context window, over 140 languages, structured outputs, and function calling, and it ships in 1B, 4B, 12B, and 27B sizes. Google also highlighted that Gemma 3 27B ranked highly in Chatbot Arena while requiring only a single GPU in their comparison chart, whereas some competing models in that chart required many more.

The ecosystem around it has matured too. Google’s own docs note that Gemma can be run with Ollama and llama.cpp on a laptop or other small device without a dedicated GPU. llama.cpp is optimized for Apple Silicon and supports aggressive integer quantization from 1.5-bit through 8-bit, which is one of the big reasons local inference has become viable on everyday hardware. LM Studio can expose OpenAI-like endpoints on localhost or a local network, and once the model is on the machine, it can operate offline.

And it is not just Gemma.

Mistral announced Mistral 3 as an open multimodal and multilingual family with dense 3B, 8B, and 14B models under Apache 2.0, plus a larger MoE model. For code specifically, Mistral’s Devstral Small 2 is a 24B model that the company says is deployable locally on consumer hardware, including consumer GPUs and even CPU-only configurations without a dedicated GPU. Qwen’s official Qwen3 materials also include explicit “run locally” guidance with llama.cpp, Ollama, and LM Studio.

That matters because it means the local story is no longer just “you can run a toy model on your laptop.” It is now “there are serious, current model families being designed with local deployment as a first-class path.”

RAG Is Where Local Economics Start Looking Very Good

This became especially obvious to me with retrieval.

For many business and personal workflows, the base model does not need to know everything. It needs to reason over the right documents, code, notes, transcripts, manuals, or research at the right time. That is a retrieval problem much more than a pretraining problem.

Once you realize that, local begins to look much stronger economically.

You download a model once. You keep your embeddings local. You re-index as your documents change. You run the same assistant fifty times in a day without watching a usage counter in the background. The cost model changes from variable per interaction to fixed local compute.

And the local embedding story has improved too. Google’s EmbeddingGemma is a 308M parameter multilingual embedding model built for on-device use. Google says it supports on-device RAG and semantic search, can run on everyday devices, and can fit in under 200MB of RAM with quantization. That is not a side detail. That is one of the missing pieces that makes a fully local retrieval stack feel plausible instead of aspirational.

That lines up almost perfectly with what I’ve been working on: local retrieval, local indexing, local querying, and automated refresh pipelines that simulate “continuous learning” without pretending the base model is retraining itself in the background.

What Worked for Me

What worked best was not trying to force a local model to be a magic universal intelligence.

What worked was using it for the jobs it is structurally good at:

private document Q&A
internal research assistants
coding copilots tied to a real codebase
summarization and extraction
classification and transformation pipelines
repeated evaluation loops
workflows where retrieval matters more than raw frontier reasoning

That is where local starts to feel less like a compromise and more like the right architecture.

In that setup, I’m not paying a provider every time I want to re-run a prompt, compare outputs, test a new chunking strategy, or let an agent explore my own files for ten extra minutes. I’m using hardware I already control to repeatedly extract value from a model I already have.

That difference sounds subtle, but it changes behavior. Metered systems make you conservative. Local systems make you iterative.

And iteration is where a lot of actual product value gets created.

What Still Falls Short

This is not a “cloud is dead” argument, and I don’t think it should be framed that way.

There are still obvious cases where paid APIs win:

when you need the absolute best frontier reasoning available
when you need elastic scale for a public-facing product
when you want managed infrastructure, support, and enterprise controls
when the quality gap is material enough that the economics stop mattering

I also wouldn’t pretend local is zero-cost. It is not. You pay in setup time, model selection, quantization choices, hardware constraints, prompt work, and evaluation discipline. And local models do not magically “learn” from new information unless you build retrieval, fine-tuning, or adapter workflows around them.

But those are engineering costs. They are not rent on every thought.

That distinction matters.

The Default I’d Use Now

My default has changed.

If I am building something around my own documents, my own notes, my own code, or a repeatable internal workflow, I would now start local first and justify the API second.

Not because local is always better.

Because for a surprising amount of real work, it is now good enough, private enough, flexible enough, and economically better aligned.

I would reach for paid APIs when I can clearly articulate why I need them: better reasoning, better reliability at scale, better multimodal performance, or faster time to production.

What I would no longer do is assume that every assistant, every RAG workflow, every coding helper, and every internal automation deserves an open-ended token bill by default.

Takeaway

The strongest argument for local LLMs is no longer ideology. It is economics plus adequacy.

Today’s local model ecosystem gives us capable open models, practical local runtimes, quantization that works on consumer hardware, and even on-device embedding models for full local RAG pipelines. At the same time, the major API vendors are still charging by token, by output, by context size, and in some cases by auxiliary tool usage.

That means the real question is no longer whether local LLMs can replace the cloud everywhere.

They can’t.

The real question is whether the cloud still deserves to be the default for work you control, data you own, and workflows you run every day.

I don’t think it does.