> Blog Post

Building a Hybrid LLM Platform on EKS, Part 1: Architecture and the Network Foundation

Across this blog we keep referring to a "hybrid LLM platform" — frontier cloud models for reasoning and planning, self-hosted open-source models for execution, all orchestrated on Kubernetes. We have written about why the hybrid pattern works, self-hosting LLMs on Kubernetes, and keeping the whole thing observable. What we have not done is show you how to build the cluster those posts assume you already have.

This series fixes that. We are going to stand up the entire platform from an empty AWS account to a working hybrid inference service, one layer at a time, using AWS CDK so every piece is real, versioned, reproducible infrastructure-as-code rather than a pile of eksctl flags nobody remembers. This is a tutorial, not a screenshot tour — you can follow along, deploy each part, and tear it down again without surprises on your next bill.

This first part does two things. It lays out the architecture for the whole series so you know where every later part fits. Then it builds the foundation that genuinely has to come first and is genuinely easy to get wrong: the network. By the end you will have a VPC, public and private subnets across multiple availability zones, NAT egress, and the VPC endpoints that keep your GPU image pulls fast and cheap — all defined in TypeScript.

What We Are Building

The platform is a hybrid inference system. Requests arrive at a router. The router decides, per request, whether the work needs a frontier model (Claude, GPT — complex reasoning, planning, ambiguous judgment) or whether a local open-source model (Llama, Qwen, served by vLLM) can handle it (code generation, summarization, extraction, classification). Frontier calls go out to a vendor API. Local calls hit model servers running on GPU nodes inside the cluster. Everything is observable, autoscaled, and cost-controlled.

That is the destination. Here is the full stack we will build to get there:

                          ┌─────────────────────────┐
   client requests  ───►  │   ALB (public subnets)  │
                          └────────────┬────────────┘
                                       │
                          ┌────────────▼────────────┐
                          │  hybrid router / gateway │   ← cloud vs. local routing
                          │     (CPU node pool)      │
                          └──────┬─────────────┬─────┘
                                 │             │
                   frontier API  │             │  local inference
                   (egress via   │             ▼
                    NAT)         │   ┌──────────────────────┐
                                 ▼   │  vLLM model servers   │
                          ┌──────────┤   (GPU node pool)     │
                          │ Claude / │└──────────────────────┘
                          │   GPT    │
                          └──────────┘
        all of it on EKS, in private subnets, observed + autoscaled

The Series Roadmap

We are deliberately breaking this into focused parts. Each one deploys cleanly on its own and the source for each lands in a companion GitHub repository (published alongside the series — links will be added here as each part ships). The plan:

  1. Architecture and the network foundation (this part) — the VPC, subnets, NAT, and VPC endpoints in CDK.
  2. The EKS control plane — the cluster itself, IAM, OIDC, and IRSA, with kubectl access.
  3. Node groups: CPU system pool and GPU pool — managed node groups, GPU AMIs, the NVIDIA device plugin, and node taints/labels.
  4. Platform add-ons — AWS Load Balancer Controller, ingress, and autoscaling (Karpenter).
  5. Serving local models — deploying vLLM, loading model weights, and request-based autoscaling.
  6. The hybrid router — the gateway that routes between cloud and local models.
  7. Observability and cost telemetry — wiring OpenTelemetry traces through the router, Prometheus and Grafana for GPU and vLLM metrics, and Langfuse for per-request token and cost telemetry, so you can see cloud-vs-local spend and tune the routing. (This is where the platform connects to the patterns in LLM Observability on Kubernetes.)
  8. Testing, load, and examples — validating the platform end-to-end and sample workloads.

If you only care about one layer, you can skip to it once it is published. But the network comes first because everything else lives inside it, and changing a VPC's address plan after you have workloads in it is genuinely painful.

Why CDK, and Why a Separate Network Stack

You can create a cluster-ready VPC in a single eksctl command. We are not going to, for the same reasons we rebuilt our CI/CD pipeline in Dagger instead of YAML: the convenient one-liner hides decisions you will need to change later, and it is not the thing you check into a repository and review in a pull request.

AWS CDK lets us define the network as a TypeScript program. We get type checking, IDE autocomplete, the ability to factor common values into constants, and — critically — a cdk diff that shows exactly what will change before it changes. The network becomes a reviewable, versioned artifact.

We also split the network into its own CDK stack, separate from the cluster stack we build in Part 2. The network changes rarely and is expensive to recreate; the cluster and its add-ons change often. Keeping them in separate stacks means a routine cluster change can never accidentally propose replacing your VPC, and a cdk deploy of the cluster does not re-evaluate network resources. The cluster stack will consume the VPC as an input.

Project Setup

If you have not used CDK before, you need Node.js 20+, the AWS CLI configured with credentials, and the CDK toolkit. Bootstrap the account once per region — CDK uses a small managed stack for asset storage and deployment roles:

npm install -g aws-cdk
mkdir eks-hybrid-llm && cd eks-hybrid-llm
cdk init app --language typescript
cdk bootstrap aws://<ACCOUNT_ID>/us-east-1

Install the single dependency we need on top of the CDK app template:

npm install aws-cdk-lib constructs

We will keep stacks in a lib/ directory and shared configuration in one place so values like the cluster name and AZ count are defined once and reused by every stack in the series.

// lib/config.ts
export const config = {
  /** Reused as a resource-name prefix and in EKS subnet discovery tags. */
  clusterName: "hybrid-llm",
  region: "us-east-1",
  /** VPC address space. /16 = 65,536 addresses — we will explain why that matters. */
  cidr: "10.0.0.0/16",
  /** Availability zones to spread across. 3 for production resilience. */
  maxAzs: 3,
  /**
   * NAT gateways. 1 is cheaper (~$32/mo + data) but a single point of egress
   * failure; set this to maxAzs for production HA. We default to 1 for the
   * tutorial and call out the tradeoff below.
   */
  natGateways: 1,
} as const;

Designing the Address Plan Before Writing Code

This is the step people skip, and it is the one that bites hardest. On EKS the Amazon VPC CNI assigns every pod a real IP address from the VPC subnet — not an overlay address from a private range invisible to the VPC. That is great for network performance and security-group integration, and it means subnet sizing is pod-capacity planning.

Do the arithmetic. A GPU node running several vLLM replicas, plus daemonset pods (CNI, monitoring, log shipping) on every node, plus the system node pool, plus headroom for rolling deployments that briefly double pod counts — pod IP demand climbs fast. A /24 subnet (256 addresses, ~251 usable after AWS reserves five) sounds generous until a busy node pool and a deploy surge exhaust it, and "subnet has no free IP addresses" is a miserable failure to debug because pods simply stay Pending with a CNI error.

So we plan for density. We give each private subnet a /20 (4,096 addresses) and each public subnet a /24 — public subnets only hold load balancers and NAT gateways, which need few addresses. With a 10.0.0.0/16 VPC and three AZs that fits comfortably with room to grow. (In a later part we will also enable VPC CNI prefix delegation, which assigns /28 IP prefixes to nodes instead of individual IPs and dramatically raises pods-per-node limits — but generous subnets are the prerequisite either way.)

The Network Stack

Here is the complete network stack. We will walk through each decision below it.

// lib/network-stack.ts
import * as cdk from "aws-cdk-lib";
import { Construct } from "constructs";
import * as ec2 from "aws-cdk-lib/aws-ec2";
import { config } from "./config";

export class NetworkStack extends cdk.Stack {
  /** Exposed so the cluster stack (Part 2) can place the EKS cluster in it. */
  public readonly vpc: ec2.Vpc;

  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    this.vpc = new ec2.Vpc(this, "Vpc", {
      ipAddresses: ec2.IpAddresses.cidr(config.cidr),
      maxAzs: config.maxAzs,
      natGateways: config.natGateways,
      subnetConfiguration: [
        {
          name: "public",
          subnetType: ec2.SubnetType.PUBLIC,
          cidrMask: 24, // LBs + NAT only — small is fine
          mapPublicIpOnLaunch: false,
        },
        {
          name: "private",
          subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
          cidrMask: 20, // pods get real VPC IPs — size for density
        },
      ],
    });

    this.tagSubnetsForEks();
    this.addVpcEndpoints();

    new cdk.CfnOutput(this, "VpcId", { value: this.vpc.vpcId });
  }

  /**
   * EKS and the AWS Load Balancer Controller auto-discover subnets by tag.
   * Public subnets get role/elb (internet-facing LBs); private subnets get
   * role/internal-elb (internal LBs) and host the worker nodes.
   */
  private tagSubnetsForEks() {
    const clusterTag = `kubernetes.io/cluster/${config.clusterName}`;
    cdk.Tags.of(this.vpc).add(clusterTag, "shared");

    for (const subnet of this.vpc.publicSubnets) {
      cdk.Tags.of(subnet).add("kubernetes.io/role/elb", "1");
    }
    for (const subnet of this.vpc.privateSubnets) {
      cdk.Tags.of(subnet).add("kubernetes.io/role/internal-elb", "1");
    }
  }

  /**
   * VPC endpoints keep high-volume AWS traffic off the NAT gateway.
   * Pulling multi-gigabyte GPU/model container images through NAT is slow
   * and billed per GB; the S3 + ECR endpoints route it privately instead.
   */
  private addVpcEndpoints() {
    // Gateway endpoints are free and cover S3 (where ECR layers live).
    this.vpc.addGatewayEndpoint("S3Endpoint", {
      service: ec2.GatewayVpcEndpointAwsService.S3,
    });

    // Interface endpoints (hourly + per-GB, cheaper than NAT at volume).
    const interfaceEndpoints: Record<string, ec2.InterfaceVpcEndpointAwsService> = {
      EcrApi: ec2.InterfaceVpcEndpointAwsService.ECR,
      EcrDocker: ec2.InterfaceVpcEndpointAwsService.ECR_DOCKER,
      Logs: ec2.InterfaceVpcEndpointAwsService.CLOUDWATCH_LOGS,
      Sts: ec2.InterfaceVpcEndpointAwsService.STS,
    };

    for (const [id, service] of Object.entries(interfaceEndpoints)) {
      this.vpc.addInterfaceEndpoint(`${id}Endpoint`, {
        service,
        subnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
      });
    }
  }
}

And the app entrypoint that instantiates it:

// bin/app.ts
import * as cdk from "aws-cdk-lib";
import { NetworkStack } from "../lib/network-stack";
import { config } from "../lib/config";

const app = new cdk.App();
const env = { region: config.region };

new NetworkStack(app, "HybridLlmNetwork", { env });

Walking Through the Decisions

PRIVATE_WITH_EGRESS for worker subnets. EKS worker nodes and pods go in private subnets — they should never be directly reachable from the internet. PRIVATE_WITH_EGRESS gives them a route to the NAT gateway so they can reach out (to pull images, call the frontier model API, hit AWS services) while remaining unreachable inbound. The public subnets exist only for the load balancers that front the platform and for the NAT gateways themselves.

mapPublicIpOnLaunch: false on public subnets. We never launch nodes into public subnets, so nothing there should auto-assign a public IP. Setting this false is a small hardening step that prevents an accidental public-facing instance.

The EKS discovery tags are not optional. This is a classic first-cluster failure. EKS and the AWS Load Balancer Controller find subnets by tag, not by name. Without kubernetes.io/role/elb on public subnets, an internet-facing Service type=LoadBalancer or ALB Ingress has nowhere to provision and silently fails to get an address. We apply these tags now so the controller we install in Part 4 just works. The kubernetes.io/cluster/<name> tag with value shared marks the VPC as usable by the cluster without claiming exclusive ownership.

NAT gateways are the cost lever. A NAT gateway is roughly $32/month plus per-GB data processing. One NAT gateway is a single point of egress failure: if its AZ has an issue, pods in the other AZs lose outbound connectivity. For production you set natGateways: config.maxAzs so each AZ has its own. For a tutorial you are tearing down nightly, one is fine — which is exactly why it is a config value, not a hardcoded constant.

VPC endpoints pay for themselves on a GPU cluster. GPU and model-server container images are large — multi-gigabyte CUDA base layers are normal. Every byte pulled through a NAT gateway is billed and bottlenecked. The S3 gateway endpoint (free) and the ECR interface endpoints route image pulls privately and faster. The STS and CloudWatch Logs endpoints keep IRSA token exchange and log shipping off NAT too. On a cluster that constantly pulls big images and ships logs, these endpoints typically cost less than the NAT data charges they eliminate — and they reduce your exposure to public-internet routing.

Deploy It

# See exactly what will be created before creating it.
cdk diff HybridLlmNetwork

# Provision the network.
cdk deploy HybridLlmNetwork

The deploy takes a few minutes — NAT gateways and interface endpoints are the slow part. When it finishes, CDK prints the VpcId output. Confirm the shape of what you built:

# List the subnets and their CIDRs, grouped by AZ.
aws ec2 describe-subnets \
  --filters "Name=tag:kubernetes.io/cluster/hybrid-llm,Values=shared" \
  --query "Subnets[].{AZ:AvailabilityZone,CIDR:CidrBlock,Public:MapPublicIpOnLaunch}" \
  --output table

You should see public /24 and private /20 subnets across your three AZs. That is the foundation Part 2 builds the cluster into.

Tearing Down

Because this is a tutorial you will want to destroy and rebuild freely. The network stack on its own tears down cleanly:

cdk destroy HybridLlmNetwork

A word of warning that matters once the cluster exists: in later parts the EKS cluster will create AWS resources (load balancers, ENIs) inside this VPC that CDK did not create and therefore does not know to delete. If you try to destroy the network while those exist, the VPC deletion hangs on dependencies. The rule for the rest of the series: tear down in reverse order of creation — workloads, then add-ons, then cluster, then network. We will repeat this reminder where it bites.

What's Next

You now have a properly sized, properly tagged, cost-aware network defined entirely in CDK — the unglamorous layer that determines whether everything above it works. In Part 2 we drop the EKS control plane into this VPC: the cluster itself, the OIDC provider, IAM roles, and IRSA, ending with a working kubectl connection to an empty-but-real cluster.

If you want the broader context for why this platform is shaped the way it is before going deeper, start with The Hybrid AI Playbook and Self-Hosting LLMs on Kubernetes. This series is where those ideas become a running cluster.