Building a Hybrid LLM Platform on EKS, Part 8: Testing, Load, and Examples
In Part 7 we wired the observability stack: OTel traces through the TypeScript router, GPU and vLLM metrics in Grafana, and per-request cost data in Langfuse. The platform is fully instrumented. This final part validates it.
Part 8 does three things. It writes a Vitest integration suite that exercises every routing path — explicit local, explicit cloud, auto heuristics, streaming, and the cold-start fallback. It runs a k6 load test that ramps the platform to realistic concurrency so you can watch KEDA scale vLLM replicas, Karpenter provision GPU nodes, and the Grafana dashboards from Part 7 light up with real traffic. And it builds three TypeScript workload scripts that demonstrate the practical value of the hybrid approach: batch classification on the local model, long-document summarization with automatic backend selection, and multi-step planning on Claude.
Test Directory Layout
All test and workload code lives in a tests/ directory at the repo root, separate from the CDK infrastructure code.
tests/
├── integration/
│ ├── router.test.ts
│ └── vitest.config.ts
├── load/
│ └── k6-load.ts
├── workloads/
│ ├── classify.ts
│ ├── summarize.ts
│ └── plan.ts
└── package.json
tests/package.json
{
"name": "hybrid-llm-tests",
"type": "module",
"scripts": {
"test": "vitest run",
"test:integration": "vitest run tests/integration",
"test:watch": "vitest",
"load": "k6 run --out json=results.json tests/load/k6-load.ts"
},
"dependencies": {
"openai": "^4.67.0"
},
"devDependencies": {
"@types/k6": "^0.54.0",
"@types/node": "^22.0.0",
"typescript": "^5.6.0",
"vitest": "^2.1.0"
}
}
The openai package is a runtime dependency, not dev-only — the workload scripts import it at runtime to talk to the router. Because our router speaks the OpenAI wire format, the same SDK that would talk to api.openai.com can target the platform with only a baseURL change.
tests/integration/vitest.config.ts
import { defineConfig } from "vitest/config";
export default defineConfig({
test: {
globals: true,
testTimeout: 30_000, // inference requests can be slow — allow 30s per test
hookTimeout: 10_000,
},
});
Integration Tests
The integration suite runs against the live router. Point it at the local port-forward or the ALB via the ROUTER_URL environment variable.
// tests/integration/router.test.ts
import { describe, it, expect } from "vitest";
import OpenAI from "openai";
const ROUTER_URL = process.env.ROUTER_URL ?? "http://localhost:8080";
// All requests go through our router, not directly to OpenAI.
// The router accepts the same API surface and routes to vLLM or Claude.
const client = new OpenAI({
baseURL: `${ROUTER_URL}/v1`,
apiKey: "unused",
});
// ── Health ────────────────────────────────────────────────────────────────────
describe("health", () => {
it("returns 200", async () => {
const res = await fetch(`${ROUTER_URL}/health`);
expect(res.status).toBe(200);
const body = await res.json();
expect(body).toEqual({ status: "ok" });
});
});
// ── Explicit routing ──────────────────────────────────────────────────────────
describe("explicit routing", () => {
it("routes model='local' to vLLM and returns a completion", async () => {
const res = await fetch(`${ROUTER_URL}/v1/chat/completions`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "local",
messages: [{ role: "user", content: "Reply with the single word: pong" }],
max_tokens: 10,
}),
});
expect(res.ok).toBe(true);
expect(res.headers.get("x-router-backend")).toBe("local");
const body = await res.json<{ choices: Array<{ message: { content: string } }> }>();
expect(body.choices[0].message.content.trim().length).toBeGreaterThan(0);
});
it("routes model='cloud' to Anthropic and returns a completion", async () => {
const res = await fetch(`${ROUTER_URL}/v1/chat/completions`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "cloud",
messages: [{ role: "user", content: "Reply with the single word: pong" }],
max_tokens: 10,
}),
});
expect(res.ok).toBe(true);
expect(res.headers.get("x-router-backend")).toBe("cloud");
});
it("accepts named Claude models", async () => {
const completion = await client.chat.completions.create({
model: "claude-sonnet-4-6",
messages: [{ role: "user", content: "What is 1+1?" }],
max_tokens: 10,
});
// The OpenAI SDK reads the response body — if this throws, the shape is wrong.
expect(completion.choices[0].message.content).toBeTruthy();
});
});
// ── Auto routing heuristics ───────────────────────────────────────────────────
describe("auto routing heuristics", () => {
it("routes a short prompt to local", async () => {
const res = await fetch(`${ROUTER_URL}/v1/chat/completions`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "auto",
messages: [{ role: "user", content: "What colour is the sky?" }],
max_tokens: 20,
}),
});
expect(res.headers.get("x-router-backend")).toBe("local");
});
it("routes a long prompt with large max_tokens to cloud", async () => {
// ~2 000 estimated tokens — above LOCAL_TOKEN_THRESHOLD
const longContent = "Please elaborate on the following concept: ".repeat(100);
const res = await fetch(`${ROUTER_URL}/v1/chat/completions`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "auto",
messages: [{ role: "user", content: longContent }],
max_tokens: 2000,
}),
});
expect(res.headers.get("x-router-backend")).toBe("cloud");
});
});
// ── Streaming ─────────────────────────────────────────────────────────────────
describe("streaming", () => {
it("streams a local response as SSE chunks", async () => {
const res = await fetch(`${ROUTER_URL}/v1/chat/completions`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "local",
messages: [{ role: "user", content: "Count from one to five, one word per line." }],
max_tokens: 40,
stream: true,
}),
});
expect(res.ok).toBe(true);
expect(res.headers.get("content-type")).toContain("text/event-stream");
const reader = res.body!.getReader();
const dec = new TextDecoder();
const chunks: string[] = [];
while (true) {
const { done, value } = await reader.read();
if (done) break;
chunks.push(dec.decode(value, { stream: true }));
}
const payload = chunks.join("");
expect(payload).toContain("data:");
expect(payload).toContain("[DONE]");
});
it("streams a cloud response in OpenAI SSE format", async () => {
const stream = await client.chat.completions.create({
model: "cloud",
messages: [{ role: "user", content: "Say hello." }],
max_tokens: 20,
stream: true,
});
const parts: string[] = [];
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta.content ?? "";
parts.push(delta);
}
expect(parts.join("").trim().length).toBeGreaterThan(0);
});
});
// ── Cold-start fallback ───────────────────────────────────────────────────────
describe("cold-start fallback", () => {
it("falls back to cloud when vLLM is unavailable", async () => {
// Gate behind an env var so this only runs in the environment where vLLM
// has been deliberately scaled to 0 (e.g. in CI with kubectl scale).
if (!process.env.TEST_COLD_START) {
console.log("Skipping cold-start test — set TEST_COLD_START=1 to enable");
return;
}
const res = await fetch(`${ROUTER_URL}/v1/chat/completions`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "local",
messages: [{ role: "user", content: "Hello." }],
max_tokens: 10,
}),
});
// Request succeeded — caller never sees a 503.
expect(res.ok).toBe(true);
// But the backend was cloud, not local.
expect(res.headers.get("x-router-backend")).toBe("cloud");
});
});
Load Test
The k6 load test ramps to 30 concurrent users, holds for three minutes, then ramps down. Eighty percent of requests are short prompts that should route local; twenty percent are long prompts that should route to cloud. The split approximates a realistic mixed workload.
// tests/load/k6-load.ts
import http from "k6/http";
import { check, sleep } from "k6";
import { Counter, Trend } from "k6/metrics";
import type { Options } from "k6/options";
// Per-backend latency trends let Grafana's k6 dashboard break down
// local vs cloud performance independently.
const localCount = new Counter("router_local_requests");
const cloudCount = new Counter("router_cloud_requests");
const localLatency = new Trend("router_local_latency_ms", true);
const cloudLatency = new Trend("router_cloud_latency_ms", true);
export const options: Options = {
stages: [
{ duration: "1m", target: 10 }, // warm up
{ duration: "3m", target: 30 }, // sustained load
{ duration: "1m", target: 0 }, // ramp down
],
thresholds: {
http_req_failed: ["rate<0.01"], // < 1 % errors
http_req_duration: ["p(95)<10000"], // < 10 s overall
router_local_latency_ms: ["p(95)<4000"], // local p95 < 4 s
router_cloud_latency_ms: ["p(95)<12000"], // cloud p95 < 12 s
},
};
const ROUTER_URL = __ENV.ROUTER_URL ?? "http://localhost:8080";
const SHORT = [
"Classify the sentiment: 'This product is fantastic!'",
"Extract the country from: 'Meeting in Berlin, Germany on Friday.'",
"Is this a question? 'The weather is nice today.'",
"Summarize in five words: 'The quarterly earnings exceeded analyst expectations.'",
"What language is this? 'Bonjour, comment allez-vous?'",
];
const LONG = [
"You are a principal engineer reviewing a system design proposal. The proposal describes a globally distributed event-sourcing platform for a fintech company that processes 500,000 transactions per second. The system must guarantee exactly-once delivery, support time-travel queries across the full event history, and operate across four cloud regions with active-active replication. Evaluate the proposal's approach to conflict resolution, identify the three most significant risks, and recommend a phased rollout strategy that minimises customer-facing risk.",
"Analyse the trade-offs between PostgreSQL row-level security, a dedicated authorisation service, and application-layer permission checks for a multi-tenant SaaS product. The product has 10,000 tenants, each with up to 500 users and a complex permission model based on roles, resource tags, and time-bound access grants. Provide a recommendation with justification, addressing operational overhead, query performance, and auditability.",
];
function pick<T>(arr: T[]): T {
return arr[Math.floor(Math.random() * arr.length)];
}
export default function (): void {
const useShort = Math.random() < 0.8;
const prompt = useShort ? pick(SHORT) : pick(LONG);
const maxTokens = useShort ? 60 : 600;
const start = Date.now();
const res = http.post(
`${ROUTER_URL}/v1/chat/completions`,
JSON.stringify({
model: "auto",
messages: [{ role: "user", content: prompt }],
max_tokens: maxTokens,
}),
{
headers: { "Content-Type": "application/json" },
timeout: "30s",
},
);
const latency = Date.now() - start;
const ok = check(res, {
"status 200": (r) => r.status === 200,
"has choices": (r) => {
try {
const body = JSON.parse(r.body as string);
return (body.choices?.[0]?.message?.content?.length ?? 0) > 0;
} catch {
return false;
}
},
});
if (ok) {
const backend = res.headers["X-Router-Backend"];
if (backend === "local") {
localCount.add(1);
localLatency.add(latency);
} else {
cloudCount.add(1);
cloudLatency.add(latency);
}
}
sleep(1);
}
Compile and run with:
# k6 requires a bundled JS file — use esbuild to compile the TypeScript.
npm install -g esbuild
esbuild tests/load/k6-load.ts --bundle --platform=browser --outfile=tests/load/k6-load.js
ROUTER_URL=https://<your-alb-hostname> k6 run tests/load/k6-load.js
Sample Workloads
The three workload scripts below demonstrate the hybrid approach applied to real tasks. Each imports the OpenAI SDK pointed at the router — the same client code that would target api.openai.com.
workloads/classify.ts — Batch classification on the local model
Short, deterministic tasks. All requests use model: "local" so they never touch the Anthropic API regardless of prompt length. Token cost for this workload is the GPU amortised over running time, not per-token API spend.
// tests/workloads/classify.ts
import OpenAI from "openai";
const client = new OpenAI({
baseURL: `${process.env.ROUTER_URL ?? "http://localhost:8080"}/v1`,
apiKey: "unused",
});
const reviews = [
"Absolutely love this! Fast delivery and perfect quality.",
"Completely broken on arrival. Waste of money.",
"It's decent. Does what it says, nothing more.",
"Best purchase I've made this year. Highly recommend.",
"Mediocre at best. Expected much better for the price.",
"Packaging was damaged but the product itself is fine.",
"Terrible customer service when I tried to return it.",
"Works great, very happy with this.",
];
const SYSTEM = `Classify the sentiment of the review as exactly one of:
POSITIVE, NEGATIVE, or NEUTRAL.
Reply with only the classification — no explanation.`;
console.log("Classifying", reviews.length, "reviews on the local model...\n");
for (const review of reviews) {
const result = await client.chat.completions.create({
model: "local",
messages: [
{ role: "system", content: SYSTEM },
{ role: "user", content: review },
],
max_tokens: 5,
temperature: 0,
});
const label = result.choices[0].message.content?.trim() ?? "UNKNOWN";
console.log(`${label.padEnd(10)} | ${review.slice(0, 60)}`);
}
Run with:
cd tests && ROUTER_URL=https://<alb> npx tsx workloads/classify.ts
Expected output:
Classifying 8 reviews on the local model...
POSITIVE | Absolutely love this! Fast delivery and perfect quality.
NEGATIVE | Completely broken on arrival. Waste of money.
NEUTRAL | It's decent. Does what it says, nothing more.
...
workloads/summarize.ts — Adaptive summarization with auto routing
Short articles stay local; long articles route to cloud automatically. The workload script does not need to know which backend handled each request — the X-Router-Backend header records it, and Langfuse aggregates the split.
// tests/workloads/summarize.ts
import OpenAI from "openai";
const client = new OpenAI({
baseURL: `${process.env.ROUTER_URL ?? "http://localhost:8080"}/v1`,
apiKey: "unused",
});
interface Article {
title: string;
body: string;
}
// In production, load these from a database or S3.
const articles: Article[] = [
{
title: "Short market note",
body: "Tech stocks rose sharply on Tuesday following better-than-expected earnings from major semiconductor firms.",
},
{
title: "Quarterly earnings deep-dive",
body: `${"The company reported Q3 revenue of $4.2 billion, representing 18% year-over-year growth driven by strong demand in its cloud infrastructure division. ".repeat(20)}
Operating margins expanded by 340 basis points to 24.1%, reflecting operational leverage and a favourable mix shift toward higher-margin software subscriptions.
${"The CFO highlighted that recurring revenue now represents 73% of total sales, up from 61% in the prior year, reducing volatility in the revenue base. ".repeat(20)}`,
},
];
const SYSTEM = "Summarise the article in two to three concise sentences.";
for (const article of articles) {
const res = await fetch(
`${process.env.ROUTER_URL ?? "http://localhost:8080"}/v1/chat/completions`,
{
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "auto",
messages: [
{ role: "system", content: SYSTEM },
{ role: "user", content: article.body },
],
max_tokens: 150,
}),
},
);
const backend = res.headers.get("x-router-backend") ?? "unknown";
const body = await res.json<{ choices: Array<{ message: { content: string } }> }>();
const summary = body.choices[0].message.content.trim();
console.log(`\n[${backend.toUpperCase()}] ${article.title}`);
console.log(summary);
}
workloads/plan.ts — Multi-step planning on the cloud model
Tasks that require reasoning across multiple constraints, maintaining consistency over a long output, or synthesising information that benefits from a frontier model's broader training. Explicit model: "cloud" ensures these never fall through to the local model regardless of the auto-routing thresholds.
// tests/workloads/plan.ts
import OpenAI from "openai";
const client = new OpenAI({
baseURL: `${process.env.ROUTER_URL ?? "http://localhost:8080"}/v1`,
apiKey: "unused",
});
const SYSTEM = `You are a senior engineering manager.
Given a technical goal, produce a phased delivery plan with:
- Phase name and duration
- Key deliverables
- Primary risks and mitigations
Be concise. Use markdown headers.`;
const goal = `
Migrate a monolithic Node.js application (300k LOC, 50 engineers, 99.9% SLA)
to a microservices architecture on Kubernetes over the next 12 months.
The application handles financial transactions and has strict audit requirements.
The team has no prior Kubernetes experience.
`;
const stream = await client.chat.completions.create({
model: "cloud",
messages: [
{ role: "system", content: SYSTEM },
{ role: "user", content: goal },
],
max_tokens: 800,
stream: true,
});
process.stdout.write("\nMigration plan:\n\n");
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta.content ?? "";
process.stdout.write(delta);
}
process.stdout.write("\n");
Walking Through the Decisions
Vitest over Jest for the integration suite
Vitest is a drop-in Jest replacement with native TypeScript support, ESM-first module resolution, and substantially faster startup time. For integration tests that hit a live HTTP endpoint, test runner startup overhead does not dominate — but the native ESM support matters: the openai SDK ships as ESM and requires no Babel transform or extra Jest configuration to work.
The testTimeout: 30_000 override acknowledges that inference requests can take several seconds. Vitest's default of 5 seconds will kill tests against a loaded vLLM instance waiting for a batch slot. The integration suite is not a unit test — the timeouts should reflect real network and inference latency, not the speed of in-memory logic.
The OpenAI SDK as the integration test client
Using the openai SDK to call the router rather than raw fetch for most tests proves a meaningful property: any existing code that uses the OpenAI SDK can switch to the hybrid platform by changing one environment variable. The SDK's response type definitions also give the test assertions precise type checking without manual response type definitions.
Raw fetch is still the right choice when testing properties of the HTTP layer itself — status codes, headers like x-router-backend, and SSE framing — because the SDK abstracts those away. The suite uses both: SDK where the content of the response is what matters, fetch where the HTTP mechanics are what matters.
k6 for load testing
k6 runs load tests as JavaScript (compiled from TypeScript via esbuild). It handles connection pooling, concurrency, and metric collection natively. The custom metrics router_local_requests, router_cloud_requests, and the latency trends let k6's summary report and the optional InfluxDB or Prometheus remote-write output break down performance by backend. Without these, the k6 report aggregates all requests together and you cannot see that local p95 is 2 seconds while cloud p95 is 8 seconds — a difference that is invisible in the combined http_req_duration.
The 80/20 short/long split is calibrated to the routing thresholds from Part 6: 80% of prompts should route local (and be fast and free), 20% should route cloud (slower, paid). If the k6 summary shows router_local_requests close to 80% of total, the heuristics are working correctly. If local is lower than expected, check whether KEDA scaled vLLM to zero during the test and the health-check fallback kicked in more than anticipated.
What the dashboards should show under load
With the k6 test running at 30 concurrent users:
Grafana — Inference namespace:
vllm:num_requests_runningshould rise from 0 toward the vLLM batch size limitvllm:num_requests_waitingwill spike briefly as KEDA triggers a replica scale-upDCGM_FI_DEV_GPU_UTILshould climb above 50% — if it stays near zero while requests are queued, the pod is not correctly bound to the GPU (revisit the node affinity and toleration from Part 3)
Grafana — KEDA and Karpenter:
- The
kube_horizontalpodautoscaler_status_current_replicasmetric for the vLLM HPA shows KEDA scaling up replicas as queue depth exceeds the threshold - Karpenter node provisioning events appear in the karpenter namespace logs and in any Loki/logging stack you have attached
Langfuse — Analytics:
- Cost per hour split by backend (local vs cloud) shows the economic benefit of the hybrid approach in real numbers
- Latency distributions confirm local is faster for short completions and cloud handles the long generations without timeouts
- The generation list shows cold-start fallbacks as
local-requested-but-cloud-served generations — if you see many of these during the test warmup, increasing the KEDA cron trigger's baseline replica count (from Part 5) to 2 would eliminate them
The cold-start test requires environment setup
The cold-start integration test is gated behind TEST_COLD_START=1 because it requires scaling vLLM to zero before running — something that should only happen in a dedicated test environment, not production. The recommended CI pattern is:
# In CI, before running the cold-start test:
kubectl -n inference scale deployment/vllm --replicas=0
kubectl -n inference wait deployment/vllm --for=condition=Available=False --timeout=30s
TEST_COLD_START=1 ROUTER_URL=$ROUTER_URL npx vitest run tests/integration/router.test.ts
# Restore afterward.
kubectl -n inference scale deployment/vllm --replicas=1
Keeping this out of the default test run means the integration suite can run against production without disrupting running workloads.
Running the Full Test Suite
cd tests
# Install dependencies.
npm install
# Run integration tests against the live cluster.
# Port-forward for local testing, or use the ALB hostname directly.
kubectl -n router port-forward svc/router 8080:80 &
ROUTER_URL=http://localhost:8080 npm test
# Run the load test against the ALB.
npm run load -- --env ROUTER_URL=https://<alb-hostname>
Integration test output with all routing paths working:
✓ health > returns 200
✓ explicit routing > routes model='local' to vLLM and returns a completion
✓ explicit routing > routes model='cloud' to Anthropic and returns a completion
✓ explicit routing > accepts named Claude models
✓ auto routing heuristics > routes a short prompt to local
✓ auto routing heuristics > routes a long prompt with large max_tokens to cloud
✓ streaming > streams a local response as SSE chunks
✓ streaming > streams a cloud response in OpenAI SSE format
cold-start fallback > Skipping cold-start test — set TEST_COLD_START=1 to enable
Test Files 1 passed (1)
Tests 8 passed | 1 skipped (9)
The Platform We Built
Over eight parts we took an empty AWS account to a working hybrid inference service, one CDK stack at a time. Here is what the final infrastructure looks like:
| Stack | What it contains |
|---|---|
HybridLlmNetwork |
VPC, public/private subnets across 3 AZs, NAT, S3/ECR/STS endpoints |
HybridLlmCluster |
EKS control plane, OIDC provider, KMS-encrypted Secrets, node IAM role |
HybridLlmNodeGroups |
CPU system pool (on-demand m7i), GPU pool (Spot g5), NVIDIA device plugin |
HybridLlmAddons |
AWS Load Balancer Controller, Karpenter with GPU NodePool |
HybridLlmInference |
KEDA, vLLM on GPU pool, Qwen2.5-7B from S3, cron scale-to-zero |
HybridLlmRouter |
TypeScript/Hono router, ALB Ingress, OTel + Langfuse instrumented |
HybridLlmObservability |
kube-prometheus-stack, DCGM Exporter, Grafana Tempo, OTel Collector, Langfuse |
A single cdk deploy --all from a clean account stands up the entire platform in roughly 30 minutes — the EKS control plane is the slowest step. A cdk destroy --all (in reverse order, with the Ingress deleted first) tears it all down.
Rough cost at idle (US East 1, on-demand pricing): two m7i.xlarge system nodes ($0.19/hr combined), NAT gateway ($0.045/hr), interface VPC endpoints (~$0.07/hr). GPU nodes are zero when idle — Karpenter terminates them when vLLM scales to zero. Under active inference the cost depends entirely on how much traffic routes local (GPU Spot time) vs cloud (Anthropic token pricing), which is exactly what the Langfuse dashboard makes visible.
Where to Go From Here
The series companion posts that informed the architectural choices throughout:
- The Hybrid AI Playbook — the strategic case for routing between frontier and local models
- Self-Hosting LLMs on Kubernetes — the operational considerations behind vLLM and the GPU node configuration
- LLM Observability on Kubernetes — the observability patterns Part 7 implements on this platform
The natural next extensions to the platform:
Multi-model serving. Add a second vLLM Deployment running a code-focused model (DeepSeek-Coder, CodeQwen) alongside the general-purpose Qwen model. Extend the router's CLOUD_MODELS and LOCAL_MODELS maps and update the heuristics to steer code-related prompts to the code model specifically.
Production hardening. Replace the bundled Langfuse PostgreSQL with an RDS instance. Add the External Secrets Operator to eliminate the manual kubectl create secret step. Enable VPC CNI prefix delegation to increase pod density. Move the router's Langfuse and OTel secrets into AWS Secrets Manager with an IRSA role.
Larger models. The GPU pool's g5.xlarge (24 GB VRAM) handles 7–13B models at float16. For 70B models, replace the GPU NodePool with p4d.24xlarge (8× A100, 320 GB aggregate VRAM) and set --tensor-parallel-size 8 in the vLLM args. The CDK pattern — node class, node pool, Deployment, KEDA ScaledObject — is identical.