The Agent Control Plane: Frontier Models Plan, Your Kubernetes Fleet Executes

Our hybrid AI playbook made the case for routing individual requests — frontier models for reasoning, local models for execution. That works beautifully when you know at development time which lane a call belongs to. But the most valuable AI workloads are not single calls. They are open-ended goals: "migrate this service off the deprecated API," "audit these 400 documents for PII," "build the CRUD layer for these twelve entities." You cannot pre-route a goal, because at the start you do not yet know which tasks it decomposes into.

That is the orchestration problem. You need something that takes a fuzzy goal, breaks it into concrete tasks, decides which of those tasks need frontier intelligence and which can run on cheap local hardware, dispatches them to the right place, and — critically — lets the agents working those tasks discover and create new tasks as they go. A code-generation task might discover it needs a database schema that does not exist yet. An audit task might find a document that needs human review. The work graph grows as the work happens.

This post is the architecture we deploy for teams at Entuit who have outgrown single-request routing. The core idea is a task-queue control plane: frontier models do the thinking and planning, a durable task ledger holds the work, and a fleet of agents — some calling cloud APIs, most running on your own Kubernetes-hosted models — pulls tasks and executes them.

Why a Task Queue Beats Agents Talking to Each Other

The instinct when building multi-agent systems is to wire agents together directly: the planner calls the coder, the coder calls the reviewer, the reviewer calls back. This is the "conversation" model, and it falls apart in production for the same reasons that chained synchronous microservice calls do.

No durability. If the coder agent crashes mid-task, the whole chain is lost. There is no record of what was in flight.
No backpressure. A planner that emits twenty code-generation tasks at once will hammer your inference layer with twenty concurrent requests, regardless of how many GPUs you have.
No observability. When a goal takes forty minutes and touches thirty agent invocations, "the conversation" is not a debuggable artifact. You cannot see queue depth, retry counts, or cost per task.
No dynamic fan-out. If an agent needs to spawn three follow-up tasks, who does it call? The topology is hardcoded.

The alternative is the blackboard pattern: agents never talk to each other directly. They read from and write to a shared, durable task ledger. The orchestrator appends tasks; workers claim tasks, execute them, write results back, and optionally append new tasks. Every unit of work is a row you can inspect, retry, and replay.

┌──────────────────────────────────────────────────────────────┐
│                          Goal Input                            │
└───────────────────────────────┬────────────────────────────────┘
                                 ▼
                  ┌──────────────────────────────┐
                  │   Orchestrator (Claude Opus)  │
                  │   • Decompose goal → tasks    │
                  │   • Tag each task's lane      │
                  │   • Aggregate results         │
                  │   • Decide when goal is done  │
                  └───────────────┬───────────────┘
                                  │ append / read
                                  ▼
        ┌─────────────────────────────────────────────────┐
        │              Task Ledger (Postgres)              │
        │   id │ type │ lane │ deps │ status │ result      │
        │  ────┼──────┼──────┼──────┼────────┼─────────    │
        │   ▲                                       ▲       │
        └───┼───────────────────────────────────────┼──────┘
            │ claim ready tasks                      │ write results
            ▼                                        │  + spawn tasks
   ┌─────────────────┐                     ┌─────────────────────┐
   │  Dispatcher      │  route by lane     │   Worker Fleet       │
   │  (Redis Streams) │ ─────────────────► │                      │
   └─────────────────┘                     │  ┌───────────────┐   │
                                           │  │ Frontier pool  │   │
                                           │  │ Claude / GPT   │   │
                                           │  └───────────────┘   │
                                           │  ┌───────────────┐   │
                                           │  │ Local pool     │   │
                                           │  │ vLLM + Qwen/   │   │
                                           │  │ Llama (KEDA)   │   │
                                           │  └───────────────┘   │
                                           └─────────────────────┘

The ledger is the source of truth. Redis Streams is the dispatch mechanism that lets workers block-and-wait for ready work without polling Postgres in a tight loop. You could collapse both into one system (Temporal, NATS JetStream, even a single Postgres table with SELECT ... FOR UPDATE SKIP LOCKED), but separating durable state from dispatch keeps each piece simple.

The Task as the Unit of Work

Everything hinges on a well-designed task. A task is a self-contained, idempotent unit that any worker in the right lane can pick up and execute. Here is the schema we use:

import { randomUUID } from "node:crypto";

export type Lane = "frontier" | "local";  // frontier → cloud API, local → k8s model

export type Status =
  | "pending"          // created, dependencies not yet met
  | "ready"            // dependencies satisfied, can be claimed
  | "claimed"          // a worker holds a lease on it
  | "running"
  | "done"
  | "failed"
  | "needs_followup";  // spawned children, waiting on them

export interface Task {
  id: string;                          // uuid
  goalId: string;                      // which top-level goal this belongs to
  type: string;                        // "plan", "generate_code", "extract", ...
  lane: Lane;
  payload: Record<string, unknown>;    // everything the worker needs
  dependsOn: string[];                 // task ids
  status: Status;
  result: Record<string, unknown> | null;
  parentId: string | null;             // task that spawned this one
  depth: number;                       // how deep in the spawn tree
  attempts: number;
  leaseExpiresAt: number | null;
}

// Factory with sane defaults — caller supplies the essentials.
export function newTask(
  init: Pick<Task, "goalId" | "type" | "lane" | "payload"> & Partial<Task>,
): Task {
  return {
    id: randomUUID(),
    dependsOn: [],
    status: "pending",
    result: null,
    parentId: null,
    depth: 0,
    attempts: 0,
    leaseExpiresAt: null,
    ...init,
  };
}

Three fields do the heavy lifting for orchestration:

lane — set at creation time. This is the explicit routing decision. As we argued in the hybrid post, explicit routing beats dynamic classification. The agent creating the task knows whether it is "design the migration strategy" (frontier) or "rewrite this function to the new API signature" (local). Encode that knowledge when you have it.
dependsOn — turns a flat queue into a DAG. A task becomes ready only when every task in dependsOn is done. This is how you express "review can't start until codegen finishes."
depth and parentId — the guardrail against runaway recursion. Every spawned task carries its parent's depth + 1. When you let agents create tasks, you must cap how deep the tree can go.

The Orchestrator: Decompose, Then Get Out of the Way

The orchestrator is the only component that always runs on a frontier model, and it touches the smallest fraction of total tokens. Its job is decomposition and aggregation — the parts that genuinely need broad reasoning — not execution.

import Anthropic from "@anthropic-ai/sdk";

const cloud = new Anthropic();

const ORCHESTRATOR_SYSTEM = `You are an orchestrator. Given a goal, decompose it
into concrete, independently-executable tasks. For each task, decide its lane:

- "frontier": needs deep reasoning, ambiguity resolution, architectural
  judgment, or broad world knowledge. Expensive. Use sparingly.
- "local": a well-defined task a junior engineer could do from a clear spec —
  code generation, extraction, summarization, format conversion, boilerplate.

Express dependencies between tasks. Output JSON: a list of tasks, each with
{ "type", "lane", "payload", "dependsOn" (indices into this list) }.

Prefer many small local tasks over few large frontier ones. The goal is to keep
frontier usage to planning and judgment; push volume to local.`;

async function decompose(goal: string, context: string): Promise<Task[]> {
  const response = await cloud.messages.create({
    model: "claude-opus-4-7",
    max_tokens: 4096,
    system: ORCHESTRATOR_SYSTEM,
    messages: [{ role: "user", content: `Goal: ${goal}\n\nContext: ${context}` }],
  });
  const block = response.content[0];
  const text = block.type === "text" ? block.text : "";
  return buildDag(parseJson(text), newGoalId());
}

After decomposition, the orchestrator does not babysit execution. It subscribes to goal completion events and wakes up only when (a) all leaf tasks for a goal are done, or (b) a task lands in needs_followup and requires a frontier-level decision about what to do next. This is the key cost lever: the orchestrator is invoked a handful of times per goal, while the worker fleet runs hundreds of tasks. Your most expensive model is on the critical path the least.

Dynamic Task Creation: Letting the Fleet Grow the Graph

This is the part that makes the system feel like a team of agents rather than a static pipeline. A worker, mid-task, can append new tasks to the ledger. The classic case: a code-generation task discovers it needs something that does not exist yet.

interface TaskResult {
  status: Status;
  result?: Record<string, unknown>;
  spawned?: Task[];
  requeueWithDeps?: string[];
}

async function runGenerateCode(task: Task): Promise<TaskResult> {
  const spec = task.payload.spec as string;
  // The local model attempts the task and reports what it's missing.
  const output = await localCall({
    model: "qwen2.5-coder:32b",
    prompt: `Implement this spec. If you need a type, schema, or interface
that is not provided, do NOT invent it. Instead respond with:
NEED: <one-line description of the missing dependency>

Spec: ${spec}
Available context: ${task.payload.context ?? "none"}`,
  });

  if (output.startsWith("NEED:")) {
    // Spawn a follow-up task and pause this one.
    const missing = output.slice(5).trim();
    const child = newTask({
      goalId: task.goalId,
      type: "design_schema",
      lane: "frontier",            // schema design needs judgment
      payload: { requirement: missing },
      parentId: task.id,
      depth: task.depth + 1,
    });
    return {
      status: "needs_followup",
      spawned: [child],
      // re-queue self, now depending on the child
      requeueWithDeps: [child.id],
    };
  }

  return { status: "done", result: { code: output } };
}

Notice what happened: a local worker, hitting a wall, escalated by creating a frontier task. The local model is cheap enough that "try it, and tell me what you're missing" is a reasonable first move. When it succeeds, you paid pennies. When it needs help, it routes the hard sub-problem to the expensive lane — and only that sub-problem.

This is also where you must be disciplined about guardrails, because dynamic task creation is how multi-agent systems rack up surprise bills and infinite loops:

const MAX_DEPTH = 5;             // how deep the spawn tree can go
const MAX_TASKS_PER_GOAL = 200;  // total task budget for one goal
const MAX_TOKENS_PER_GOAL = 5_000_000;

function admitTask(task: Task, goalState: GoalState): boolean {
  if (task.depth > MAX_DEPTH) {
    throw new GuardrailBreach(`task ${task.id} exceeds max depth`);
  }
  if (goalState.taskCount >= MAX_TASKS_PER_GOAL) {
    throw new GuardrailBreach(`goal ${task.goalId} hit task ceiling`);
  }
  if (goalState.tokensSpent >= MAX_TOKENS_PER_GOAL) {
    throw new GuardrailBreach(`goal ${task.goalId} hit token budget`);
  }
  return true;
}

A goal that breaches a guardrail does not silently spin — it halts and escalates to the orchestrator (or a human) with the partial work intact. The durable ledger means nothing is lost; you can inspect the tree, raise the ceiling, and resume.

The Dispatcher: Routing by Lane

The dispatcher's job is mechanical: find tasks that are READY (all dependencies DONE), and hand them to the correct worker pool based on lane. No intelligence required — the routing decision was already made when the task was created.

async function dispatchLoop(): Promise<void> {
  while (true) {
    const ready = await ledger.claimReadyTasks({
      limit: BATCH_SIZE,
      leaseSeconds: 300,   // worker must finish or renew within 5 min
    });
    for (const task of ready) {
      const stream = task.lane === "frontier" ? "tasks.frontier" : "tasks.local";
      await redis.xAdd(stream, "*", { taskId: task.id });
    }
  }
}

Two Redis Streams, one per lane, give you independent backpressure. The frontier pool is a fixed, small set of workers bounded by your cloud API rate limits and budget. The local pool is where the volume lives — and where Kubernetes earns its keep.

claimReadyTasks uses a lease (visibility timeout). A worker that crashes without acking has its lease expire, and the task returns to ready for another worker. Combine that with idempotent task design and at-least-once delivery becomes safe.

The Local Worker Fleet on Kubernetes

The local pool is a Deployment of generic "agent runner" pods. Each pod blocks on the tasks.local Redis Stream, claims a task, executes it against the in-cluster vLLM Service, writes the result, and acks. The container is model-agnostic — it loads the right prompt template and local model name from the task type.

The reason this lives on Kubernetes rather than a static pool is that task volume is bursty. A goal sits idle, then the orchestrator decomposes it into 80 code-generation tasks all at once. You want the worker fleet to scale with queue depth, not pay for idle GPUs. KEDA does this natively by scaling on Redis Stream length:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: local-agent-workers
  namespace: agents
spec:
  scaleTargetRef:
    name: local-agent-worker        # the Deployment
  minReplicaCount: 1                 # keep one warm to avoid cold-start
  maxReplicaCount: 12                # bounded by GPU node pool capacity
  cooldownPeriod: 120
  triggers:
    - type: redis
      metadata:
        address: redis.agents.svc.cluster.local:6379
        stream: tasks.local
        consumerGroup: local-workers
        pendingEntriesCount: "5"     # scale up when >5 tasks per replica pending

Each worker pod requests a GPU (or a slice of one via time-slicing / MIG), and the pods land on your GPU node pool. When the queue drains, KEDA scales back to one warm replica. This is the same autoscaling philosophy from our self-hosting LLMs guide, applied to the agent layer rather than the raw inference layer.

// The local worker pod's main loop (simplified)
async function workerLoop(consumerName: string): Promise<void> {
  while (true) {
    const streams = await redis.xReadGroup(
      "local-workers", consumerName,
      [{ key: "tasks.local", id: ">" }],
      { COUNT: 1, BLOCK: 30_000 },
    );
    if (!streams?.length) continue;

    const { id: msgId, message } = streams[0].messages[0];
    const task = await ledger.load(message.taskId);
    await ledger.mark(task.id, "running");
    try {
      const result = await HANDLERS[task.type](task);    // dispatch by type
      for (const child of result.spawned ?? []) {         // dynamic task creation
        if (admitTask(child, await goalState(task.goalId))) {
          await ledger.append(child);
        }
      }
      await ledger.complete(task, result);
      await redis.xAck("tasks.local", "local-workers", msgId);
    } catch (err) {
      await ledger.fail(task, err);                       // lease expiry → retry
    }
  }
}

Putting It Together: A Real Goal

Take a concrete goal: "Add structured audit logging to every write endpoint in this Go service." Watch where each lane gets used.

Orchestrator (Claude Opus) — reads the service, identifies 23 write endpoints, and decomposes into: one frontier task to design the audit log schema and middleware contract, then 23 local tasks (one per endpoint) to wire in the logging, plus one local task to generate a migration, plus one frontier task to review the whole diff for consistency.
Frontier worker (Claude Sonnet) — designs the AuditEntry schema and middleware interface. Result lands in the ledger; the 23 endpoint tasks depend on it, so they were PENDING until now and flip to READY.
Local pool (Qwen 2.5 Coder 32B, KEDA scales to 8 pods) — 23 endpoint tasks fan out across the worker fleet. Three of them hit unusual patterns and emit NEED: follow-ups; those spawn small frontier tasks that resolve the ambiguity, then the local tasks resume.
Local worker — generates the database migration from the finalized schema.
Frontier worker (Claude Sonnet) — reviews the assembled diff once all 23 are DONE, flags two inconsistencies, and spawns two local fix-up tasks.
Orchestrator — wakes on goal completion, summarizes the change set, and marks the goal DONE.

Frontier models touched 4 planning/review steps. The local fleet ran 26 tasks. Roughly 15% of tokens went to the cloud; 85% ran on hardware we already owned. The total cloud cost for the goal was under a dollar, versus an estimated $6-8 had every task run on a frontier model — and the wall-clock time was shorter, because the 23 endpoint tasks ran in parallel across the worker fleet instead of sequentially through one expensive agent.

Running It on Your Laptop

You do not need a GPU node pool or a Kubernetes cluster to get a feel for this. The entire control plane runs on a laptop: Ollama for the local model, Postgres and Redis in Docker, and the orchestrator and worker as plain Node.js processes. The frontier lane calls the Anthropic API. This is the minimal version of the architecture above — same task ledger, same Redis Streams, same lane routing — just without KEDA doing the scaling for you.

Prerequisites

Docker (for Postgres and Redis)
Node.js 20+
Ollama for the local model
An Anthropic API key for the frontier lane

Step 1 — Pull a local model

# Install Ollama (Linux), then pull a coding model for the local lane
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5-coder:7b   # 7B fits comfortably on a laptop;
                                # bump to :32b if you have a 24GB GPU

# Sanity check — Ollama serves an OpenAI-compatible API on :11434
ollama run qwen2.5-coder:7b "write a hello world in go"

On a laptop without a discrete GPU, the 7B model runs on CPU/integrated graphics at a usable speed for testing. Quality is lower than the 32B class from the examples above, but the orchestration mechanics are identical.

Step 2 — Start Postgres and Redis

Drop this docker-compose.yml in a working directory:

services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: agents
      POSTGRES_PASSWORD: agents
      POSTGRES_DB: ledger
    ports: ["5432:5432"]
  redis:
    image: redis:7
    ports: ["6379:6379"]

docker compose up -d

Step 3 — Create the task ledger

docker compose exec -T postgres psql -U agents -d ledger <<'SQL'
CREATE TABLE tasks (
  id              UUID PRIMARY KEY,
  goal_id         UUID NOT NULL,
  type            TEXT NOT NULL,
  lane            TEXT NOT NULL,           -- 'frontier' | 'local'
  payload         JSONB NOT NULL,
  depends_on      UUID[] DEFAULT '{}',
  status          TEXT NOT NULL DEFAULT 'pending',
  result          JSONB,
  parent_id       UUID,
  depth           INT NOT NULL DEFAULT 0,
  attempts        INT NOT NULL DEFAULT 0,
  lease_expires_at TIMESTAMPTZ
);
CREATE INDEX idx_tasks_status ON tasks (status);
CREATE INDEX idx_tasks_goal ON tasks (goal_id);

CREATE TABLE goals (
  id           UUID PRIMARY KEY,
  task_count   INT NOT NULL DEFAULT 0,
  tokens_spent BIGINT NOT NULL DEFAULT 0
);
SQL

Step 4 — Set up the Node project

npm init -y
npm install @anthropic-ai/sdk openai redis pg
npm install -D tsx typescript @types/node @types/pg

export ANTHROPIC_API_KEY="sk-ant-..."        # frontier lane
export OLLAMA_BASE_URL="http://localhost:11434/v1"
export DATABASE_URL="postgresql://agents:agents@localhost:5432/ledger"
export REDIS_URL="redis://localhost:6379"

The two model clients wire up exactly like the snippets earlier in this post — the frontier client is new Anthropic(), and the local client is an OpenAI client pointed at OLLAMA_BASE_URL (Ollama exposes an OpenAI-compatible API). The dispatcher publishes to two streams (tasks.frontier, tasks.local); on a laptop you just run one worker process per lane instead of a KEDA-scaled Deployment.

Step 5 — Run the worker and the orchestrator

Open two terminals (both with the env vars exported):

# Terminal 1 — a worker that drains both lanes (one process is fine locally)
npx tsx worker.ts --lanes frontier,local --consumer laptop-1

# Terminal 2 — submit a goal; the orchestrator decomposes and seeds the ledger
npx tsx orchestrate.ts --goal "Generate a TypeScript CLI that converts CSV to JSON, \
  with argument parsing, error handling, and a vitest suite"

orchestrate.ts calls decompose() (the orchestrator from earlier), writes the resulting tasks to the ledger, and exits. The worker picks up READY tasks, routes each to Ollama or the Anthropic API by its lane, writes results back, and spawns follow-ups as needed. Watch the work happen:

# Live view of the task graph filling in
watch -n1 'docker compose exec -T postgres psql -U agents -d ledger \
  -c "SELECT type, lane, status FROM tasks ORDER BY depth, type;"'

When every leaf task is done, the orchestrator's completion handler assembles the final output. For this goal you end up with the CLI scaffold and the planning/review steps done by Claude, and the bulk of the code generated locally by Qwen — the same lane split as the production system, running entirely on your machine for the cost of a few cents of Anthropic tokens.

Step 6 (optional) — Mirror production with a local cluster

To exercise the KEDA autoscaling path without leaving your laptop, run a local cluster with k3d or kind, install KEDA, and deploy the worker as a Deployment with the ScaledObject from earlier pointed at your in-cluster Redis. You will not see meaningful scaling on a single machine, but it validates the manifests before you push them to a real GPU cluster. For the full cluster-side setup, see our self-hosting LLMs on Kubernetes guide.

Observability Is Not Optional Here

A multi-agent system that you cannot see into is a liability. Every task should emit a trace span, and every span should carry goal_id, task_type, lane, model, token counts, and cost. With the task ledger as your backbone, you get this almost for free — each task row is already a structured record. Wire it into the stack from our LLM observability post and you can answer the questions that matter:

Cost per goal, broken down by lane. If frontier spend creeps above ~20% of the total, your orchestrator is over-decomposing into frontier tasks — tighten the system prompt.
Queue depth and KEDA replica count over time. Confirms the local fleet scales with load and drains afterward.
Spawn-tree depth distribution. A long tail of deep trees means agents are escalating too often — usually a sign your local model lacks context, not capability.
Task retry and failure rates by type. Tells you which handlers need better prompts or belong in a different lane.

Common Pitfalls

Do not let the orchestrator stay on the critical path. The most common mistake is building an orchestrator that supervises every task synchronously. That puts your most expensive model in the loop for the cheapest work and serializes everything. The orchestrator plans, then sleeps, then aggregates. Workers run independently.

Do not skip idempotency. At-least-once delivery means a task can run twice (lease expiry, retry after a crash). If a task has side effects — writing a file, posting a comment, calling an external API — make it idempotent or guard it with a dedupe key. The ledger's task id is a natural key.

Do not route ambiguity to local models and hope. The NEED: escalation pattern works because the local model is instructed to ask rather than invent. A local model that hallucinates a missing schema instead of escalating will quietly corrupt the whole goal. Bake the "escalate when uncertain" instruction into every local handler prompt.

Do cap everything. Depth, task count per goal, token budget, lease duration, max replicas. Dynamic task creation without ceilings is how a single malformed goal spawns ten thousand tasks overnight. Every guardrail should halt-and-escalate, never silently drop work.

Do start with explicit lanes before reaching for a classifier. It is tempting to build an ML model that decides each task's lane. Resist it until you have data. Explicit lane tagging by the creating agent is deterministic, debuggable, and good enough for the vast majority of workloads.

The Bottom Line

Single-request hybrid routing gets you cost savings on individual calls. An agent control plane gets you something bigger: the ability to throw a fuzzy, open-ended goal at the system and have it decompose, fan out, self-correct, and finish — using frontier intelligence only where judgment is genuinely required, and your own Kubernetes fleet for everything else.

The architecture is not exotic. A durable task ledger, a couple of Redis Streams, a frontier orchestrator that plans and aggregates, and a KEDA-scaled pool of local worker pods. The hard part is discipline: keep the expensive model off the hot path, make tasks idempotent, tag lanes explicitly, and cap the recursion. Get those right and you have a system where the marginal cost of "do more work" trends toward the cost of electricity — which is exactly where you want it.

Start small. Take one multi-step workflow you already run through a single frontier agent, model it as a goal with a handful of tasks, and split the obvious execution work onto a local worker. Measure the cost and the wall-clock time. Then let the graph grow.