The Hybrid AI Playbook: Cloud Models for Thinking, Local Models for Doing
The biggest mistake teams make with AI integration is using the same model for everything. A developer running Claude Opus to reformat JSON files is like hiring a senior architect to paint walls. The work gets done, but the economics make no sense.
The smarter approach is a hybrid architecture: use frontier cloud models (Claude, GPT) for tasks that require deep reasoning, multi-step planning, and complex judgment — then route the execution work to local models running on your own hardware. This is not theoretical. Teams we work with at Entuit have cut their AI spend by 60-80% using this pattern, without meaningful quality loss on the tasks that matter.
This post covers the architecture, the model recommendations, and the hardware you need to run it.
Why Hybrid Works
Frontier models like Claude Opus and GPT-4o are exceptional at tasks that require broad knowledge, nuanced reasoning, and long-context understanding. But they are expensive — Claude Opus runs $15 per million input tokens and $75 per million output tokens. GPT-4o is $2.50/$10. For a team making thousands of API calls per day, that compounds into serious money fast.
The key insight is that most of those calls do not need frontier-level intelligence. A typical AI-assisted workflow looks something like this:
- Planning — Analyze a codebase, understand requirements, design a solution approach
- Execution — Write the code, generate tests, format output, extract data
- Review — Evaluate the result, check for correctness, suggest improvements
Steps 1 and 3 genuinely benefit from a model that can reason across a large context and make nuanced judgments. Step 2 — which accounts for 60-80% of the total tokens — often does not. Code generation from a clear spec, text summarization, data extraction, JSON transformation, boilerplate scaffolding — these are well-defined tasks where a good 8B-32B parameter local model performs within 5-10% of a frontier model.
The cost difference is dramatic:
| Task Type | Cloud Model (Claude Opus) | Local Model (Qwen 2.5 32B) | Savings |
|---|---|---|---|
| 1M tokens of code generation | $75.00 | ~$0.12 (electricity) | 99.8% |
| 1M tokens of summarization | $75.00 | ~$0.12 (electricity) | 99.8% |
| 1M tokens of complex reasoning | $75.00 | Poor quality — use cloud | 0% |
Local model costs assume amortized hardware running at reasonable utilization. The point is not that local inference is free — it is that the marginal cost per token approaches zero once you own the hardware.
The Architecture
A hybrid setup has three components: a routing layer that decides which model handles each task, a cloud API client for frontier models, and a local inference server for execution tasks.
┌─────────────────────────────────────────────┐
│ Your Application │
├─────────────────────────────────────────────┤
│ Router / Orchestrator │
│ ┌─────────────┐ ┌─────────────────┐ │
│ │ Cloud API │ │ Local Inference │ │
│ │ (Claude / │ │ (Ollama / vLLM) │ │
│ │ OpenAI) │ │ │ │
│ │ │ │ Qwen 2.5 32B │ │
│ │ Planning │ │ Llama 3.1 8B │ │
│ │ Review │ │ DeepSeek-Coder │ │
│ │ Complex │ │ Codestral │ │
│ │ reasoning │ │ │ │
│ └─────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────┘
The router does not need to be sophisticated. In most cases, explicit task-type routing is better than trying to dynamically classify requests. You know at development time whether a call is "plan this feature" or "generate a unit test from this spec." Route accordingly.
import anthropic
import openai
# Cloud client for planning and reasoning
cloud = anthropic.Anthropic()
# Local client — Ollama and vLLM both expose OpenAI-compatible APIs
local = openai.OpenAI(
base_url="http://localhost:11434/v1", # Ollama
api_key="ollama", # Ollama doesn't need a real key
)
def plan_feature(requirements: str, codebase_context: str) -> str:
"""Use Claude for complex planning that needs deep reasoning."""
response = cloud.messages.create(
model="claude-sonnet-4-6-20250514",
max_tokens=4096,
messages=[{
"role": "user",
"content": f"""Analyze these requirements and design an implementation plan.
Requirements: {requirements}
Existing codebase context:
{codebase_context}
Provide a detailed plan including file changes, function signatures,
and edge cases to handle."""
}]
)
return response.content[0].text
def generate_code(spec: str) -> str:
"""Use a local model for well-defined code generation."""
response = local.chat.completions.create(
model="qwen2.5-coder:32b",
messages=[{
"role": "user",
"content": f"""Generate the implementation based on this spec.
Output only the code, no explanation.
{spec}"""
}],
temperature=0.1,
)
return response.choices[0].message.content
def extract_data(document: str, schema: str) -> str:
"""Use a local model for structured data extraction."""
response = local.chat.completions.create(
model="llama3.1:8b",
messages=[{
"role": "user",
"content": f"""Extract data from this document into the given JSON schema.
Output valid JSON only.
Document: {document}
Schema: {schema}"""
}],
temperature=0.0,
)
return response.choices[0].message.content
The key detail here is that both Ollama and vLLM expose OpenAI-compatible APIs. This means your local inference server is a drop-in replacement — you can swap between cloud and local by changing a base URL, not rewriting your application logic.
Which Tasks Go Where
Not every task benefits equally from a local model. Here is how we break it down based on real-world deployments:
Send to Cloud (Claude / GPT)
- Architectural planning — Designing system components, evaluating tradeoffs, multi-file refactoring plans
- Complex code review — Catching subtle bugs, security vulnerabilities, race conditions
- Long-context analysis — Understanding large codebases, multi-document reasoning
- Ambiguous requirements — Interpreting vague specs, asking clarifying questions, making judgment calls
- Novel problem-solving — Tasks where the model needs broad world knowledge or creative approaches
Run Locally
- Code generation from specs — When the plan is clear and the task is well-defined
- Unit test generation — Given a function signature and behavior description, generate tests
- Text summarization — Condensing documents, generating abstracts, creating changelogs
- Data extraction — Pulling structured data from unstructured text
- Format conversion — JSON to CSV, markdown to HTML, log parsing
- Boilerplate generation — CRUD endpoints, database models, API clients
- Commit messages and documentation — Generating descriptions from diffs
- Translation and rewriting — Converting between languages, adjusting tone
The rule of thumb: if you can write a clear, specific prompt that a junior developer could follow without asking questions, a local model can handle it.
Recommended Local Models (May 2026)
Model selection depends on your hardware and use case. Here is what we recommend today:
For Code Generation
Qwen 2.5 Coder 32B — The best local coding model available. It matches or exceeds GPT-4o on most coding benchmarks while running comfortably on consumer hardware with quantization. Handles Python, TypeScript, Go, Rust, Java, and most mainstream languages with strong quality. This is our default recommendation for any team running a hybrid setup.
ollama pull qwen2.5-coder:32b
DeepSeek-Coder-V2 16B — A strong alternative if you are VRAM-constrained. Slightly behind Qwen 2.5 Coder on benchmarks but still very capable, and runs well on 16GB GPUs with quantization.
Codestral 22B (Mistral) — Excellent at multi-file code generation and understanding project structure. A good choice when your code generation tasks involve coordinating across multiple files.
For General Tasks (Summarization, Extraction, Writing)
Llama 3.1 8B — The workhorse model. Fast, efficient, and surprisingly capable for well-defined tasks. Runs on almost any modern GPU (or even CPU-only for light workloads). Use this as your default for simple extraction, formatting, and summarization tasks.
ollama pull llama3.1:8b
Qwen 2.5 32B — When you need more capability than the 8B class but do not want to pay cloud prices. Excellent at following complex instructions, structured output, and multi-step tasks.
Mistral Small 24B — Strong at European languages and structured reasoning. A good choice if your workload involves multilingual content.
For Embeddings and RAG
nomic-embed-text — A solid local embedding model for retrieval-augmented generation pipelines. Run your entire RAG pipeline locally if data sensitivity requires it.
ollama pull nomic-embed-text
Hardware Recommendations
Your hardware choice depends on how many concurrent requests you need to serve and which models you want to run.
Developer Workstation (1-2 Users)
This is the entry point — a single developer or small team running local models alongside their development environment.
Minimum viable setup:
- GPU: NVIDIA RTX 4070 Ti Super (16GB VRAM) — ~$800
- RAM: 32GB DDR5
- Storage: 500GB NVMe (models are 4-20GB each)
- What it runs: Llama 3.1 8B at full speed, Qwen 2.5 Coder 32B with Q4 quantization at acceptable speed (~15 tokens/sec)
Recommended setup:
- GPU: NVIDIA RTX 4090 (24GB VRAM) — ~$1,600
- RAM: 64GB DDR5
- Storage: 1TB NVMe
- What it runs: Qwen 2.5 Coder 32B at Q5 quantization with good speed (~25 tokens/sec), Llama 3.1 8B at full precision with fast inference (~80 tokens/sec)
The RTX 4090 remains the best value for local inference in 2026. Its 24GB of VRAM handles quantized 32B models comfortably, and its consumer price point makes the economics work for even small teams. A single 4090 replaces roughly $500-2,000/month in API costs depending on usage.
Team Server (5-15 Users)
For a small team sharing a local inference server, you need more VRAM and the ability to handle concurrent requests.
Recommended setup:
- GPU: 2x NVIDIA RTX 4090 or 1x NVIDIA A6000 (48GB VRAM) — $3,200-4,500
- CPU: AMD Ryzen 9 / Intel i9 (for CPU-offload on larger models)
- RAM: 128GB DDR5
- Storage: 2TB NVMe
- What it runs: Multiple models simultaneously, Qwen 2.5 Coder 32B at higher quantization with fast inference, 70B models with quantization
Run this as a dedicated server on your network, or as a VM in your cloud environment. Ollama handles multiple concurrent requests and queues when the GPU is busy. For higher concurrency, switch to vLLM which handles batching more efficiently.
Production Deployment (15+ Users)
At this scale, you should be running vLLM on Kubernetes with proper autoscaling. See our guide to self-hosting LLMs on Kubernetes for the full architecture. The hybrid approach still applies — your Kubernetes-hosted models handle the execution tasks while cloud APIs handle planning and reasoning.
Setting Up Ollama (The 10-Minute Path)
Ollama is the fastest way to get local inference running. It handles model downloading, quantization, and serving behind an OpenAI-compatible API.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull your models
ollama pull qwen2.5-coder:32b # Primary code generation model
ollama pull llama3.1:8b # Fast general-purpose model
ollama pull nomic-embed-text # Embeddings for RAG
# Verify everything works
ollama run qwen2.5-coder:32b "Write a Python function that merges two sorted lists"
Ollama automatically starts a server on localhost:11434 that exposes an OpenAI-compatible API. Your application code talks to it the same way it would talk to the OpenAI API — just change the base URL.
For team deployments, configure Ollama to listen on all interfaces:
# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl daemon-reload
sudo systemctl restart ollama
Pi: Hybrid Routing for Developer Workflows
If you are using AI coding agents in your daily workflow, Pi is worth a look. Pi is an open-source terminal coding agent that natively supports the hybrid model pattern through its subagent system — you can assign a different model to each subagent based on the task.
The idea maps directly to the hybrid architecture described above. You configure a planning agent that runs on Claude for architectural decisions and complex reasoning, a code generation agent that runs on local Qwen 2.5 Coder for writing implementation code, and a review agent that switches back to Claude for quality checks. Each subagent gets its own model, system prompt, and tools, so the routing is explicit and predictable.
A typical Pi setup for hybrid development looks like this:
- Main agent (Claude Sonnet/Opus) — You interact with this directly. It handles planning, architectural decisions, and task delegation.
- Code generation subagent (Qwen 2.5 Coder 32B via Ollama) — The main agent delegates well-defined implementation tasks here. Runs locally, no API costs.
- Test generation subagent (Qwen 2.5 Coder 32B via Ollama) — Generates unit tests from specs. Another high-volume task that runs locally.
- Reconnaissance subagent (Claude Haiku) — Fast, cheap codebase searches when you need cloud-level understanding but not full reasoning power.
Pi also supports fallback models per agent — if your local Ollama server is overloaded or a cloud provider hits a rate limit, the agent falls back to an alternative without interrupting your workflow.
The key difference from the application-level routing described earlier in this post is scope. Pi is for your development workflow — you sitting at a terminal, building software with AI assistance. For production applications that need to route API calls between cloud and local models at scale, you still want a programmatic routing layer (LiteLLM, custom code, or similar). But for the day-to-day work of writing, reviewing, and shipping code, Pi gives you hybrid routing without writing any infrastructure.
Real-World Example: AI-Assisted Code Review Pipeline
Here is a complete example of a hybrid pipeline we built for a client. It runs as a GitHub Action on every pull request:
- Cloud (Claude Sonnet) — Reads the full PR diff with repository context, identifies which changes are architecturally significant vs. routine
- Local (Qwen 2.5 Coder 32B) — Generates inline comments for routine issues (style, naming, missing error handling, test coverage gaps)
- Cloud (Claude Sonnet) — Reviews the architecturally significant changes for design issues, security concerns, and correctness
- Local (Llama 3.1 8B) — Formats all comments into a structured review and posts to GitHub
The cloud model touches roughly 20% of the total tokens. The other 80% — the inline comments, the formatting, the boilerplate review text — runs locally. The total cost per PR review dropped from $0.45 to $0.11, and the quality of architectural feedback actually improved because we could afford to send more context to the cloud model for the parts that mattered.
async def review_pull_request(pr_diff: str, repo_context: str):
# Step 1: Cloud model triages the changes
triage = await cloud_call(
model="claude-sonnet-4-6-20250514",
prompt=f"""Analyze this PR diff and categorize each change as either
'architectural' (design decisions, security implications, complex logic)
or 'routine' (style, naming, simple bugs, boilerplate).
Return JSON with two arrays of file:line_range pairs.
Diff: {pr_diff}
Context: {repo_context}"""
)
architectural_changes, routine_changes = parse_triage(triage)
# Step 2: Local model handles routine review (bulk of the work)
routine_comments = await local_call(
model="qwen2.5-coder:32b",
prompt=f"""Review these code changes and generate inline comments
for any issues you find. Focus on: style consistency, error handling,
test coverage, naming conventions.
{format_changes(routine_changes)}"""
)
# Step 3: Cloud model does deep review of architectural changes
arch_comments = await cloud_call(
model="claude-sonnet-4-6-20250514",
prompt=f"""Deep review these architectural changes. Look for:
design issues, security vulnerabilities, race conditions,
performance problems, and correctness issues.
{format_changes(architectural_changes)}
Full repo context: {repo_context}"""
)
# Step 4: Local model formats and posts
formatted = await local_call(
model="llama3.1:8b",
prompt=f"""Format these review comments into GitHub PR review format.
Group by file, include line numbers.
{routine_comments}
{arch_comments}"""
)
await post_to_github(formatted)
Common Pitfalls
Do not try to use local models for everything. The whole point of the hybrid approach is playing to each model's strengths. We have seen teams try to save money by running a local 70B model for complex reasoning tasks and ending up with worse results and higher total cost (because they had to re-run failed generations multiple times).
Do not skip quantization. Running a 32B model at full FP16 precision uses twice the VRAM and is marginally better at most tasks compared to Q5 quantization. The sweet spot for most use cases is Q4_K_M or Q5_K_M — these retain 95-98% of the model's capability at half the memory footprint.
Do monitor quality. Set up a simple evaluation pipeline that samples outputs from your local models and compares them against cloud model outputs for the same prompts. If quality drifts below your threshold on a task category, move that category back to cloud. The optimal split is not static — it shifts as local models improve.
Do keep your models updated. The local model landscape moves fast. Qwen 2.5 Coder was not available a year ago, and it outperforms models that were state-of-the-art then. Check quarterly for new releases and run your evaluation pipeline against them.
The Bottom Line
The hybrid approach is not about replacing cloud AI — it is about using it where it matters. Claude and GPT are remarkable tools for complex reasoning, and trying to replicate that locally is a waste of time. But paying frontier prices for tasks that a quantized 8B model handles perfectly is a waste of money.
Start with Ollama and a single GPU. Route your simplest, highest-volume tasks to a local Llama 3.1 8B. Measure the quality. Then gradually expand — move code generation to Qwen 2.5 Coder, move summarization and extraction tasks local, and keep your cloud budget focused on the work that actually needs frontier intelligence.
The teams that get the best results from AI are not the ones spending the most — they are the ones spending in the right places.