GPU Cost Optimization on Kubernetes: A Practical Guide
GPU compute is the single largest line item for most AI/ML organizations, and the majority of them are wasting between 40% and 70% of that spend. The root cause is straightforward: GPUs are expensive, indivisible by default, and most workloads do not fully saturate them. On Kubernetes, the problem compounds because the default scheduler treats a GPU as an atomic unit — one pod gets one GPU, regardless of whether it uses 5% or 95% of the device's capacity.
This guide walks through the practical techniques we use at Entuit to help teams reclaim that wasted spend without sacrificing performance or reliability.
The GPU Waste Problem
A single NVIDIA A100 80GB instance on a major cloud provider runs roughly $3.00-$3.50 per hour on-demand. If your inference workload only consumes 20% of the GPU's compute cycles and 8GB of its memory, you are effectively paying five times what the workload requires. Multiply that across a fleet of 20 or 50 GPUs and the annual waste reaches six or seven figures quickly.
The first step is always measurement. Before optimizing anything, instrument your cluster to understand actual GPU utilization at the pod level.
GPU Time-Slicing with the NVIDIA Device Plugin
NVIDIA's device plugin for Kubernetes supports time-slicing, which allows multiple pods to share a single physical GPU. This is ideal for inference workloads and lightweight training jobs that do not need a full device.
Configure the device plugin with a ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin
namespace: kube-system
data:
config: |
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
This configuration exposes each physical GPU as four virtual devices. Four pods can now be scheduled on a single GPU, each receiving a time-slice of the compute capacity. Apply the ConfigMap and restart the device plugin DaemonSet:
kubectl apply -f nvidia-device-plugin-config.yaml
kubectl rollout restart daemonset/nvidia-device-plugin -n kube-system
Pods request these virtual GPUs exactly as they would a full GPU:
resources:
limits:
nvidia.com/gpu: 1
The key consideration is memory. Time-slicing does not enforce memory isolation, so pods can interfere with each other if their combined memory usage exceeds the physical device's capacity. Size your replicas based on actual memory profiles, not just compute.
Scheduling Strategies
Proper node affinity and taints ensure GPU workloads land on the right hardware and non-GPU workloads stay off expensive nodes.
Taint your GPU nodes to prevent general workloads from being scheduled there:
apiVersion: v1
kind: Node
metadata:
name: gpu-node-01
labels:
gpu-type: a100
gpu-memory: "80Gi"
spec:
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
Then use tolerations and node affinity on your GPU workloads:
spec:
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "true"
effect: NoSchedule
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu-type
operator: In
values: ["a100"]
This two-layer approach prevents CPU-only pods from consuming GPU node resources and ensures GPU pods are placed on appropriate hardware tiers.
Right-Sizing with MIG and Utilization Monitoring
NVIDIA Multi-Instance GPU (MIG) on A100 and H100 devices allows a single GPU to be partitioned into up to seven isolated instances, each with dedicated compute, memory, and bandwidth. Unlike time-slicing, MIG provides hardware-level isolation.
For workloads that need guaranteed resources but not a full GPU, MIG is the better choice. A common pattern is to partition an A100 into a mix of sizes: one 3g.40gb instance for a larger model and two 1g.10gb instances for lightweight inference services.
Monitor utilization with DCGM (Data Center GPU Manager) exporter and Prometheus:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
spec:
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.4.0-ubuntu22.04
ports:
- containerPort: 9400
Key metrics to track include DCGM_FI_DEV_GPU_UTIL (compute utilization), DCGM_FI_DEV_FB_USED (framebuffer memory used), and DCGM_FI_DEV_GPU_TEMP (temperature, which indicates sustained load). Alert on sustained utilization below 30% — that is a strong signal of over-provisioning.
Cost Monitoring and Allocation
Kubecost or the OpenCost project can attribute GPU costs to individual teams, namespaces, or workloads. This visibility is essential for chargeback and for identifying optimization targets.
Configure Kubecost with custom GPU pricing that reflects your actual negotiated rates or reserved instance costs, not just on-demand list prices. The default pricing models in most tools overstate costs for organizations with committed-use discounts.
Build dashboards that show cost-per-inference and cost-per-training-run, not just raw infrastructure spend. These unit economics are what connect infrastructure decisions to business outcomes.
Practical Recommendations
Start with the highest-impact changes: enable time-slicing on inference clusters where utilization is below 50%, taint GPU nodes to prevent scheduling waste, and deploy DCGM exporter for visibility. These three steps alone typically yield a 30-40% cost reduction within the first month.
For organizations running large-scale training, the optimization strategy shifts toward spot instances, checkpointing, and preemption-tolerant training frameworks — a topic we will cover in a future post.
The goal is not to minimize GPU spend in absolute terms. It is to maximize the value delivered per dollar of GPU compute. Sometimes that means spending more on GPUs because the business return justifies it. The key is having the data to make that decision deliberately rather than by default.