The Cost-Efficient AI Stack: Ship AI Features Without the Runaway Bill
Over the past few months we have written about the individual pieces of running AI in production. This is the post that connects them — the single idea beneath all of them, and the one our work is built on.
The idea, in a sentence: most teams overpay for AI because they send every request to a frontier model — when the fix is to send each unit of work to the cheapest model that can do it well. Everything else we build — hybrid routing, self-hosting, orchestration, observability — exists to serve that one principle.
The Runaway Bill Is a Design Choice
The first AI feature a team ships almost always works the same way: a call to Claude or GPT, wrapped in a prompt, returning a result. It works beautifully. So the second feature works the same way. And the tenth. Within a quarter, every product surface that touches AI is a frontier API call, and the monthly invoice has become a line item nobody can explain — growing with usage, with no ceiling in sight.
This is not a pricing problem. Frontier models are reasonably priced for what they are. It is an allocation problem. The team is paying frontier prices for work that does not need frontier intelligence: reformatting JSON, drafting a commit message, summarizing a document, extracting fields, generating boilerplate. As we put it in the hybrid AI playbook, running Claude Opus to reformat JSON is like hiring a senior architect to paint walls. The work gets done; the economics make no sense.
The insight that unlocks everything is that a typical AI workload splits cleanly:
- A small fraction is genuine reasoning — planning, judgment, multi-step problem solving, anything ambiguous. This needs a frontier model, and it is worth every cent.
- The large majority is well-defined execution — high-volume, low-judgment work where a good open model lands within a few percent of frontier quality.
Once you see that split, the whole architecture follows. You stop asking "which AI vendor?" and start asking "which model is the cheapest one that can do this specific task well?"
Four Layers, One Principle
The cost-efficient AI stack has four layers. We have written about each one already; together, they answer the question of how you actually do this in production.
1. Hybrid routing — cloud for thinking, local for doing
The routing layer is where the principle becomes mechanism. Frontier models (Claude, GPT) handle the reasoning; local models (Qwen, Llama) handle the volume. The router's only job is to send each request to the right lane.
The key design decision, which we make the same way every time: explicit routing beats clever classification. You almost always know at call time whether a request is "plan this migration" or "generate a unit test from this spec." Tag it and route accordingly — don't build an ML classifier to rediscover what you already know. This is the full argument in the hybrid AI playbook, and it is where teams see the first 60–80% drop in spend.
2. Self-hosted inference — own the marginal cost
Routing volume work to "local models" only pays off if you can actually run them economically. That is the self-hosting LLMs on Kubernetes layer: open models served on your own hardware or cloud GPUs, with the operational discipline — GPU scheduling, autoscaling, batching — that keeps utilization high.
The economic point is the one worth internalizing: once you own the hardware, the marginal cost of a token approaches the cost of electricity. Your AI spend stops scaling linearly with usage and starts behaving like infrastructure. As a bonus that increasingly matters, your proprietary data never leaves your environment.
3. Orchestration — for goals, not just requests
Single-request routing is enough when you know which lane each call belongs to. But the most valuable AI work is open-ended: "migrate this service off the deprecated API," "audit these 400 documents." You cannot pre-route a goal, because you do not yet know which tasks it decomposes into.
That is the agent control plane: a frontier model decomposes the goal and plans, a durable task ledger holds the work, and a fleet of workers — some calling cloud APIs, most running local models — pulls tasks and runs them, spawning new tasks as it goes. The same principle scales up: the expensive model plans and aggregates; the cheap fleet does the volume. The discipline is keeping the frontier model off the hot path so you are not paying it to supervise the cheap work.
4. Observability — you cannot optimize what you cannot see
A cost-efficient stack that you cannot measure will drift back toward expensive. The observability layer instruments every request with the things that actually matter: tokens, latency, time-to-first-token, GPU utilization, and cost per request — attributable down to the feature that incurred it.
This is what turns the architecture from a one-time win into a system you can tune. When frontier spend creeps above its budget, the trace data tells you exactly which calls to re-route. It is FinOps for inference, and it is the layer that keeps the other three honest.
The Unifying Principle, Restated
Strip away the technology and every layer is doing the same thing: matching the cost of the model to the value of the work.
- Hybrid routing matches it per request.
- Self-hosting drives the floor of that cost toward electricity.
- Orchestration applies the match to open-ended goals that decompose into many requests.
- Observability lets you verify the match is holding and correct it when it drifts.
The result is a system where the marginal cost of doing more AI trends toward the cost of compute — which is exactly where you want it. You can ship AI into every product surface without the bill becoming the thing that decides what you're allowed to build.
Where It Starts: Your Own Machine
You do not adopt this all at once, and you do not start with a GPU cluster. The cheapest place to internalize the whole philosophy is your own workstation — which is why we wrote the personal AI dev environment guide. Install a local model, route your autocomplete and commit messages and summaries to it, and keep a frontier model in your pocket for the genuinely hard problems. Within a day you will see, viscerally, how much of your daily AI usage never needed the cloud. That realization is the whole thesis, in miniature.
When Not to Do This
Intellectual honesty matters more than a clean pitch, so: this architecture is not free, and it is not always worth it.
- If your AI spend is small, the engineering effort to self-host will not pay back. Stay on frontier APIs until the bill is large enough to fund the work — then audit.
- If every one of your requests genuinely needs frontier reasoning, there is no volume tier to offload. Hybrid routing only helps when a meaningful fraction of your work is well-defined execution. Most workloads have that fraction; some genuinely don't.
- If you have no one to operate inference, self-hosting becomes a liability, not a saving. The hidden cost of local models is operational, not computational — which is exactly the gap we exist to fill.
The right first step is almost never "self-host everything." It is to measure — find out what fraction of your spend is execution work that could move — and then move the obvious wins first.
The Bottom Line
The cost-efficient AI stack is not exotic. It is a router that tags lanes, open models served on your own hardware, an orchestrator that keeps the expensive model off the hot path, and tracing that attributes cost to the request that caused it. The hard part is not any single piece — it is the discipline to keep matching the cost of the model to the value of the work, and to measure that you're still doing it.
Get that right and AI stops being a runaway line item and becomes what it should be: infrastructure you own, that gets cheaper per unit as you use it more. That is the architecture we build for teams at Entuit, and it is the through-line in everything we write.
Start by measuring. Take one workflow you currently run entirely through a frontier API, instrument what it costs, and move the obvious execution work to a local model. Watch the bill, then do it again.