> Blog Post

Building a Hybrid LLM Platform on EKS, Part 3: Node Groups, GPU AMIs, and the NVIDIA Device Plugin

In Part 2 we provisioned the EKS control plane: a private API server, KMS-encrypted Secrets, an OIDC provider for IRSA, and a node role ready to attach to node groups. We verified kubectl cluster-info works and kubectl get nodes returns an empty list — healthy cluster, no capacity. Nothing can schedule yet.

Part 3 fixes that. We add two node groups to the cluster as a third CDK stack. The first is a CPU system pool — general-purpose instances for cluster add-ons, the hybrid router, and anything that does not need a GPU. The second is a GPU pool — instances with NVIDIA GPUs for the vLLM model servers we deploy in Part 5. Then we install the NVIDIA device plugin, the DaemonSet that makes GPUs visible to Kubernetes as a schedulable resource. Finally, we apply the taints and labels that keep workloads on the right nodes and ensure a GPU node is never wasted on a CPU workload.

Two Pools, One Stack

The decision to create two distinct node groups instead of one is a design constraint that surfaces repeatedly through the rest of the series, so it is worth being explicit about it upfront.

GPU instances are expensive. An g5.xlarge (NVIDIA A10G, 24 GB GPU memory) is roughly ten times the hourly cost of an m7i.xlarge. If a CPU workload — say, CoreDNS, the load balancer controller, or the hybrid router — accidentally schedules onto a GPU node, you are wasting that premium instance on work that a general-purpose node handles identically. On a cluster that scales GPU nodes down to zero overnight, a stray CPU pod that does not tolerate the GPU taint keeps a GPU instance alive for nothing.

The solution is a taint on the GPU pool. Any pod without an explicit nvidia.com/gpu toleration cannot schedule there. GPU-intensive workloads add the toleration and, optionally, a node affinity label to land on the GPU pool specifically. CPU workloads never tolerate the taint and therefore never touch GPU nodes.

The system pool runs without a matching taint — it accepts any pod that does not have a GPU requirement, which covers add-ons and the router.

The NodeGroup Stack

We add a third CDK stack that consumes the cluster and nodeRole outputs from ClusterStack. Node group definitions live here rather than in ClusterStack for the same reason the cluster lives in a different stack than the network: node groups change (you tune instance types, adjust scaling bounds, roll AMIs) without touching the cluster itself. Keeping them separate ensures cdk diff on an instance-type change does not propose modifying the control plane.

// lib/node-group-stack.ts
import * as cdk from "aws-cdk-lib";
import { Construct } from "constructs";
import * as ec2 from "aws-cdk-lib/aws-ec2";
import * as eks from "aws-cdk-lib/aws-eks";
import * as iam from "aws-cdk-lib/aws-iam";
import { config } from "./config";

interface NodeGroupStackProps extends cdk.StackProps {
  cluster: eks.Cluster;
  nodeRole: iam.Role;
}

export class NodeGroupStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props: NodeGroupStackProps) {
    super(scope, id, props);

    this.createSystemPool(props.cluster, props.nodeRole);
    this.createGpuPool(props.cluster, props.nodeRole);
    this.installNvidiaDevicePlugin(props.cluster);
  }

  private createSystemPool(cluster: eks.Cluster, nodeRole: iam.Role): eks.Nodegroup {
    return cluster.addNodegroupCapacity("SystemPool", {
      nodegroupName: `${config.clusterName}-system`,
      instanceTypes: [new ec2.InstanceType("m7i.xlarge")],
      minSize: 2,
      maxSize: 8,
      desiredSize: 2,
      nodeRole,
      amiType: eks.NodegroupAmiType.AL2023_X86_64_STANDARD,
      labels: {
        "node.kubernetes.io/purpose": "system",
      },
      tags: {
        Name: `${config.clusterName}-system-pool`,
      },
    });
  }

  private createGpuPool(cluster: eks.Cluster, nodeRole: iam.Role): eks.Nodegroup {
    return cluster.addNodegroupCapacity("GpuPool", {
      nodegroupName: `${config.clusterName}-gpu`,
      instanceTypes: [
        new ec2.InstanceType("g5.xlarge"),
        new ec2.InstanceType("g5.2xlarge"),
      ],
      minSize: 0,
      maxSize: 4,
      desiredSize: 1,
      nodeRole,
      amiType: eks.NodegroupAmiType.AL2_X86_64_GPU,
      capacityType: eks.CapacityType.SPOT,
      labels: {
        "node.kubernetes.io/purpose": "gpu-inference",
        "nvidia.com/gpu-present": "true",
      },
      taints: [
        {
          key: "nvidia.com/gpu",
          value: "present",
          effect: eks.TaintEffect.NO_SCHEDULE,
        },
      ],
      tags: {
        Name: `${config.clusterName}-gpu-pool`,
        // Karpenter and Cluster Autoscaler read this tag to identify Spot pools.
        "k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/purpose": "gpu-inference",
      },
    });
  }

  private installNvidiaDevicePlugin(cluster: eks.Cluster): void {
    cluster.addHelmChart("NvidiaDevicePlugin", {
      chart: "nvidia-device-plugin",
      repository: "https://nvidia.github.io/k8s-device-plugin",
      namespace: "kube-system",
      release: "nvidia-device-plugin",
      version: "0.17.0",
      values: {
        tolerations: [
          {
            key: "nvidia.com/gpu",
            operator: "Exists",
            effect: "NoSchedule",
          },
        ],
        resources: {
          requests: { cpu: "100m", memory: "128Mi" },
          limits: { cpu: "250m", memory: "256Mi" },
        },
      },
    });
  }
}

Update bin/app.ts to wire in the third stack:

// bin/app.ts
import * as cdk from "aws-cdk-lib";
import { NetworkStack } from "../lib/network-stack";
import { ClusterStack } from "../lib/cluster-stack";
import { NodeGroupStack } from "../lib/node-group-stack";
import { config } from "../lib/config";

const app = new cdk.App();
const env = { region: config.region };

const network = new NetworkStack(app, "HybridLlmNetwork", { env });

const cluster = new ClusterStack(app, "HybridLlmCluster", {
  env,
  vpc: network.vpc,
});

new NodeGroupStack(app, "HybridLlmNodeGroups", {
  env,
  cluster: cluster.cluster,
  nodeRole: cluster.nodeRole,
});

Walking Through the Decisions

The system pool instance type

m7i.xlarge — 4 vCPU, 16 GB memory — is deliberately sized for headroom, not bare minimums. Each node will host daemonsets from the CNI, CloudWatch agent, and Fluent Bit (Part 7), along with any Karpenter or cluster autoscaler components (Part 4), and potentially several replicas of the hybrid router (Part 6). On a xlarge those daemonsets together use under 1 vCPU and 2 GB memory, leaving the rest for application workloads and burst.

The minSize: 2 guarantees two nodes across availability zones at all times. Single-node system pools are a reliability trap — if the one node gets recycled for an AMI update, you lose all add-ons briefly, including the CNI and scheduler helpers that new pods depend on. Two nodes in separate AZs means a rolling update never drops below one running copy.

We use AL2023_X86_64_STANDARD — the Amazon Linux 2023 AMI — rather than the older AL2 image. AL2023 is the current standard for EKS 1.31+ and receives security patches on an ongoing basis. AL2 is in maintenance mode. New clusters should use AL2023 unless a specific package or configuration requires AL2.

The GPU pool instance type and AMI

g5.xlarge is the primary instance type with g5.2xlarge as a fallback — both carry an NVIDIA A10G GPU with 24 GB of GPU memory. The A10G is the right balance for the Llama and Qwen model families in the 7–13B parameter range: quantized to 8-bit, a 13B model fits comfortably in 24 GB with room for concurrent requests. Larger models or 70B+ parameter work require a GPU pool upgrade to p4d or p5 instances and is out of scope for this series.

The reason for listing two instance types in instanceTypes is Spot fleet diversity. EKS managed node groups in Spot mode pick from the provided list based on current availability and price. A single-type Spot pool is fragile — when g5.xlarge capacity is tight in a region, your pool cannot scale up. Two types from the same GPU family (same driver support, same VRAM) give the Spot allocator room to maneuver without any change to how pods schedule.

AL2_X86_64_GPU is the EKS-optimized GPU AMI, which ships with NVIDIA drivers, the CUDA toolkit, and the NVIDIA Container Toolkit (the runtime that lets containers use the GPU device). AWS maintains these AMIs — security patches and driver updates are handled by rolling the node group to a new AMI version, not by imperative apt install. The GPU AMI for EKS 1.32 ships with NVIDIA driver 550.x and CUDA 12.4. (AWS is rolling out AL2023-based GPU AMIs as AL2023_X86_64_NVIDIA; once your CDK version includes that enum value, prefer it.)

Spot for GPU, on-demand for system

The GPU pool uses capacityType: eks.CapacityType.SPOT. Spot pricing on g5.xlarge is typically 60–70% off on-demand, which is the difference between roughly $1.00/hr and $0.35/hr per GPU. For inference workloads that checkpoint frequently and tolerate brief interruptions, Spot is an appropriate choice.

The system pool uses on-demand (CDK's default). Add-ons like CoreDNS, the load balancer controller, and the cluster autoscaler should not be subject to Spot interruption — a two-minute termination notice on the node running CoreDNS creates a brief DNS outage that impacts everything on the cluster. Two on-demand system nodes cost roughly $0.19/hr combined. That is the right tradeoff.

Taints and labels

The GPU pool taint nvidia.com/gpu=present:NoSchedule does one thing: prevents any pod that does not explicitly tolerate it from scheduling onto a GPU node. This is a convention, not a Kubernetes enforcement mechanism for GPU usage — a pod can tolerate the taint and schedule onto a GPU node without actually requesting a GPU resource. Real resource isolation comes from the device plugin making nvidia.com/gpu a schedulable resource and pods declaring a resources.limits.nvidia.com/gpu: 1 request. The taint is the first gate; the resource request is the second.

The label node.kubernetes.io/purpose: gpu-inference is what workloads use in nodeAffinity rules to steer toward the GPU pool. When we deploy vLLM in Part 5, the Pod spec will include:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: node.kubernetes.io/purpose
              operator: In
              values: ["gpu-inference"]
tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Affinity without the toleration does nothing — the pod still cannot schedule on a tainted node. Both are required for vLLM pods to land on the GPU pool. Both are required for other pods to stay off it.

minSize: 0 on the GPU pool

The GPU pool starts at desiredSize: 1 for testing but can scale to zero. In Part 7 we configure Karpenter to scale GPU nodes down to zero when no vLLM replicas need them — typically overnight or during light-traffic periods. This makes the Spot GPU cost truly variable: you pay for GPU hours when models are being used, not as a fixed infrastructure cost. minSize: 0 enables the autoscaler to complete the scale-down; without it the autoscaler will stop at one node even with no schedulable pods.

The NVIDIA device plugin

The NVIDIA device plugin is a DaemonSet that runs on every GPU node, discovers the NVIDIA GPUs on the host, and advertises them to the Kubernetes API as an extended resource: nvidia.com/gpu. Without it, Kubernetes has no awareness that a GPU exists on the node — pods cannot request GPU resources, and they cannot be scheduled against GPU limits.

We install it via Helm from the official NVIDIA chart repository. The critical configuration is the toleration matching the GPU pool taint — the DaemonSet itself must tolerate nvidia.com/gpu=present:NoSchedule or it will not run on GPU nodes, defeating its purpose entirely. We pin version: "0.17.0" so CDK's diff shows exactly what would change on an upgrade rather than silently pulling latest.

The plugin runs as a privileged DaemonSet with access to the host /dev and procfs. It needs host-level access to talk to the NVIDIA driver and expose devices to containers. This is normal for device plugins and expected by the driver model.

Deploy the Node Groups

# Assumes Parts 1 and 2 are already deployed.
cdk deploy HybridLlmNodeGroups

Managed node group creation takes 3–5 minutes per group — EKS registers the Auto Scaling group, launches the first instances, and bootstraps them to join the cluster. The Helm chart for the device plugin deploys as part of the same cdk deploy.

Verify the Nodes Are Ready

Once the deploy completes, check that both pools joined the cluster:

kubectl get nodes -L node.kubernetes.io/purpose

You should see output like:

NAME                          STATUS   ROLES    AGE   VERSION    PURPOSE
ip-10-0-1-45.ec2.internal     Ready    <none>   4m    v1.32.x    system
ip-10-0-2-12.ec2.internal     Ready    <none>   4m    v1.32.x    system
ip-10-0-3-88.ec2.internal     Ready    <none>   3m    v1.32.x    gpu-inference

Verify the GPU node has the taint and the device plugin has registered the GPU resource:

# Confirm the taint is present.
kubectl describe node <gpu-node-name> | grep -A5 Taints

# Confirm the GPU resource is visible.
kubectl get node <gpu-node-name> -o jsonpath='{.status.capacity.nvidia\.com/gpu}'

The second command should output 1 (or 4 if you are using a multi-GPU instance type). If it outputs nothing, the NVIDIA device plugin DaemonSet has not started successfully on that node — check its logs:

kubectl -n kube-system logs -l name=nvidia-device-plugin-ds

Common failure modes: the DaemonSet does not tolerate the taint (check tolerations in the Helm values), or the GPU AMI bootstrap did not complete (check the EC2 instance system log via the AWS console or SSM).

Run a Quick GPU Sanity Check

Before calling Part 3 done, run a pod that exercises the GPU to confirm the full chain — AMI, device plugin, container runtime — is working:

# gpu-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: Never
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node.kubernetes.io/purpose
                operator: In
                values: ["gpu-inference"]
  containers:
    - name: cuda-check
      image: nvidia/cuda:12.4.1-base-ubuntu22.04
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: "1"
kubectl apply -f gpu-test.yaml
kubectl wait --for=condition=Succeeded pod/gpu-test --timeout=120s
kubectl logs gpu-test
kubectl delete pod gpu-test

nvidia-smi output showing the A10G (or whichever GPU your instance type carries) confirms the entire GPU stack is functional. If the pod stays Pending, confirm it both tolerates the taint and requests a GPU resource — either alone is insufficient.

Tearing Down

Destroy in reverse order:

cdk destroy HybridLlmNodeGroups
cdk destroy HybridLlmCluster
cdk destroy HybridLlmNetwork

Destroying the node group stack first drains and terminates the EC2 instances before the cluster stack removes the control plane. Destroying the cluster before the node groups leaves orphaned instances that are not part of a cluster and still cost money — always reverse the creation order.

What's Next

You now have a functional two-pool cluster: system nodes running add-ons and the NVIDIA device plugin, a GPU node that Kubernetes can schedule GPU workloads onto, and a taint/label convention that keeps CPU workloads off the expensive hardware.

In Part 4 we install the platform add-ons that make the cluster production-ready: the AWS Load Balancer Controller (so Kubernetes Ingress objects provision real ALBs), and Karpenter for node autoscaling — including the configuration that scales the GPU pool to zero and back based on whether vLLM replicas are scheduled.

The full source for Parts 1–3 is in the companion repository (link to follow).