Building a Hybrid LLM Platform on EKS
Across this blog we keep referring to a hybrid LLM platform — frontier models for the hard reasoning, self-hosted open-source models for the high-volume work, all on Kubernetes. This series builds it from an empty AWS account to a working inference service, one layer at a time, as reproducible AWS CDK infrastructure you can deploy and tear down yourself.
The Target Architecture
┌─────────────────────────┐
client requests ───► │ ALB (public subnets) │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ hybrid router / gateway │ ← cloud vs. local
│ (CPU node pool) │
└──────┬─────────────┬─────┘
│ │
frontier API │ │ local inference
(egress via │ ▼
NAT) │ ┌──────────────────────┐
▼ │ vLLM model servers │
┌──────────┤ (GPU node pool) │
│ Claude / │└──────────────────────┘
│ GPT │
└──────────┘
all of it on EKS, in private subnets, observed + autoscaledThe Parts
Each part deploys cleanly on its own, with downloadable source. Published parts link to the full walkthrough; the rest are on the way.
The EKS Control Plane
Coming soonDropping the cluster into the VPC: the EKS control plane, the OIDC provider, IAM roles, and IRSA — ending with a working kubectl connection.
Node Groups: CPU System Pool & GPU Pool
Coming soonManaged node groups for the system workloads and a GPU pool for inference — GPU AMIs, the NVIDIA device plugin, and the taints and labels that keep model servers on the right nodes.
Platform Add-ons
Coming soonThe cluster services everything else depends on: the AWS Load Balancer Controller, ingress, and Karpenter for fast, cost-aware autoscaling of GPU capacity.
Serving Local Models with vLLM
Coming soonDeploying the self-hosted inference layer — vLLM model servers, loading weights, and request-based autoscaling so GPU capacity follows demand.
The Hybrid Router
Coming soonThe gateway that makes it hybrid: routing each request to a frontier model for hard reasoning or to a local model for high-volume execution work.
Observability & Cost Telemetry
Coming soonWiring observability into the platform — OpenTelemetry traces through the router, Prometheus and Grafana for GPU and vLLM metrics, and Langfuse for per-request token and cost telemetry, so you can see cloud-vs-local spend and tune the routing.
Testing, Load & Examples
Coming soonValidating the platform end-to-end — load testing the inference layer, sample workloads, and proving the routing economics under real traffic.
Prefer the high-level version? The companion Hybrid AI Playbook and Self-Hosting LLMs on Kubernetes cover the why behind this build.
Want This Built for Your Team?
We build hybrid LLM platforms like this one for clients — reproducible, cost-aware, and documented so your team can own it. Book a free call and we'll map the fastest path.
Book a Free Call