> Blog Post

Self-Hosting LLMs on Kubernetes: A Practical Guide

Most teams start with an API call to OpenAI or Anthropic. At a few hundred requests per day, it is the right move — low friction, zero infrastructure, instant access to frontier models. But then usage grows. A support chatbot goes from internal pilot to customer-facing. A document processing pipeline scales from dozens of files to thousands. An agentic workflow chains ten LLM calls per task. Suddenly the monthly API bill is the largest line item in the cloud budget, and the CFO is asking questions.

Cost is the most visible trigger, but it is rarely the only one. Teams in healthcare, finance, and government cannot send patient records or financial data to a third-party API without compliance review that takes longer than the project itself. Latency-sensitive applications — real-time copilots, interactive search, agentic loops with sequential reasoning steps — pay a tax on every round-trip to an external endpoint. And teams that need fine-tuned models on proprietary data often find that the model they need is not available as a hosted API at all.

This guide covers the full path from choosing a serving framework to running autoscaled LLM inference on Kubernetes in production. Every manifest and configuration shown here is something we have deployed for teams at Entuit. We assume you have a working Kubernetes cluster and basic familiarity with Deployments, Services, and Helm.

When Self-Hosting Makes Sense (and When It Does Not)

The decision to self-host is ultimately a cost-per-token calculation with operational overhead factored in. Here is how the math typically breaks down for a standard workload — an enterprise chatbot averaging 1,000 input tokens and 500 output tokens per request.

Daily Requests OpenAI GPT-4o-mini Self-Hosted Llama 3.1 8B (L4) Self-Hosted Llama 3.1 70B (2xA100)
10,000 ~$15/day ~$17/day (1 GPU) ~$156/day (2 GPUs)
50,000 ~$75/day ~$17/day (1 GPU) ~$156/day (2 GPUs)
100,000 ~$150/day ~$34/day (2 GPUs) ~$312/day (4 GPUs)
500,000 ~$750/day ~$119/day (7 GPUs) ~$780/day (10 GPUs)
1,000,000 ~$1,500/day ~$204/day (12 GPUs) ~$1,560/day (20 GPUs)

Pricing as of April 2026. OpenAI GPT-4o-mini: $0.15/1M input tokens, $0.60/1M output tokens. L4 GPU: ~$0.70/hr on-demand. A100 80GB: ~$3.25/hr on-demand. Self-hosted estimates assume 80% GPU utilization.

For 8B-class models, self-hosting breaks even around 10,000-20,000 requests per day. For 70B-class models, the crossover point is much higher because GPU costs scale steeply. But these numbers exclude operational overhead — budget roughly 0.25-0.5 FTE ($15K-25K/month) for the engineering time to manage a GPU fleet, amortized across your total GPU footprint.

Beyond cost, self-hosting makes clear sense when:

  • Data cannot leave your network. HIPAA, SOC 2, ITAR, or internal policies that prohibit sending data to third-party APIs. Self-hosted models run entirely within your VPC.
  • Latency matters. Eliminating the round-trip to an external API shaves 50-200ms per call. For agentic workflows chaining 10+ calls, that compounds fast.
  • You need a fine-tuned model. Custom weights trained on proprietary data are not available through any hosted API. You need to serve them yourself.

And self-hosting does not make sense when:

  • Volume is low. Under 10,000 requests per day, the operational overhead dwarfs the cost savings.
  • You need frontier capabilities. GPT-4o, Claude Opus, and Gemini Ultra have no open-source equivalents. If your use case requires that level of reasoning, APIs are the only option.
  • Your team has no Kubernetes or GPU experience. The learning curve is real. Budget 2-4 weeks for a team's first production deployment.

Choosing a Model Serving Framework

Three frameworks dominate the space: vLLM, Text Generation Inference (TGI), and Ollama. They solve different problems.

vLLM is the production standard. It uses PagedAttention for efficient KV-cache memory management, continuous batching to maximize GPU throughput, and natively exposes an OpenAI-compatible API. Model support is broad — Llama 3, Mistral, Qwen, Phi, Gemma, and most Hugging Face transformer architectures. Tensor parallelism across multiple GPUs works out of the box for models that exceed a single GPU's VRAM. For production Kubernetes deployments, we use vLLM almost exclusively.

TGI (Text Generation Inference by Hugging Face) is a solid alternative, especially if your workflow is deeply integrated with the Hugging Face ecosystem. It supports quantization well and has strong Hugging Face Hub integration. However, vLLM consistently outperforms TGI on throughput benchmarks for most model architectures, and TGI's API is less compatible with existing OpenAI SDK-based codebases.

Ollama is excellent for local development and prototyping. The developer experience is unmatched — ollama run llama3 and you are chatting with a model in seconds. But it is a poor fit for production Kubernetes. It lacks continuous batching, handles concurrent requests poorly, and its opinionated model management does not map well to Kubernetes' declarative resource model.

The rest of this guide assumes vLLM.

GPU Infrastructure on Kubernetes

GPU Node Pools

Keep GPU nodes in dedicated node pools, separate from your CPU workloads. Taint the GPU nodes so that only workloads that explicitly tolerate the taint get scheduled there. This prevents a CPU-only pod from accidentally landing on a $3/hr GPU node.

GPU type selection depends on model size and throughput requirements:

  • NVIDIA T4 (16GB VRAM): The budget option. Handles 7B parameter models with 4-bit quantization. Around $0.35/hr on spot instances. Good for development and low-traffic production workloads.
  • NVIDIA L4 (24GB VRAM): The sweet spot for most inference workloads. Runs 7B-13B models at full precision or 30B+ models with quantization. Roughly $0.70/hr on-demand. This is where most teams start.
  • NVIDIA A100 (40GB or 80GB VRAM): Required for 70B+ parameter models or high-throughput serving that needs tensor parallelism across multiple GPUs. Around $3.00-3.50/hr on-demand.
  • NVIDIA H100: Overkill for most inference. Best reserved for training workloads or extremely high throughput requirements where the additional FP8 performance justifies the cost.

Here is a typical GPU node configuration with taints and labels:

# GKE example: create a GPU node pool with L4 GPUs
gcloud container node-pools create gpu-l4-pool \
  --cluster=production \
  --machine-type=g2-standard-8 \
  --accelerator=type=nvidia-l4,count=1 \
  --num-nodes=2 \
  --node-taints=nvidia.com/gpu=true:NoSchedule \
  --node-labels=gpu-type=l4,node-role=inference
# EKS example: node group with GPU AMI
eksctl create nodegroup \
  --cluster=production \
  --name=gpu-l4-pool \
  --node-type=g6.2xlarge \
  --nodes=2 \
  --node-labels="gpu-type=l4,node-role=inference" \
  --node-taints="nvidia.com/gpu=true:NoSchedule"

NVIDIA Device Plugin and GPU Operator

On managed Kubernetes (GKE, EKS with GPU AMIs), the NVIDIA device plugin is usually pre-installed. For self-managed clusters, install the GPU Operator, which bundles the driver, device plugin, and DCGM exporter:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set dcgmExporter.enabled=true

Verify that your nodes expose GPU resources:

kubectl get nodes -l gpu-type=l4 -o json | \
  jq '.items[].status.allocatable | {"nvidia.com/gpu", memory, cpu}'

You should see "nvidia.com/gpu": "1" (or however many GPUs per node) in the allocatable resources. If this field is missing, the device plugin is not running correctly. We covered GPU monitoring with DCGM and cost attribution with Kubecost in detail in our GPU cost optimization guide.

Deploying vLLM on Kubernetes

This is the core of the setup. We will deploy Llama 3.1 8B Instruct on a single L4 GPU.

The Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-8b
  namespace: inference
  labels:
    app: vllm
    model: llama3-8b
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
      model: llama3-8b
  template:
    metadata:
      labels:
        app: vllm
        model: llama3-8b
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Equal
          value: "true"
          effect: NoSchedule
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: gpu-type
                    operator: In
                    values: ["l4"]
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.8.3
          args:
            - "--model"
            - "meta-llama/Llama-3.1-8B-Instruct"
            - "--max-model-len"
            - "8192"
            - "--gpu-memory-utilization"
            - "0.90"
            - "--dtype"
            - "auto"
            - "--port"
            - "8000"
          ports:
            - containerPort: 8000
              name: http
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          resources:
            requests:
              cpu: "4"
              memory: "16Gi"
              nvidia.com/gpu: "1"
            limits:
              nvidia.com/gpu: "1"
          startupProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 10
            failureThreshold: 30
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            periodSeconds: 30
            failureThreshold: 3
      terminationGracePeriodSeconds: 120

A few things worth calling out:

Startup probe with generous timeouts. Model loading is slow — 60-180 seconds depending on model size and disk throughput. The startupProbe with failureThreshold: 30 and periodSeconds: 10 gives the container up to 360 seconds to load the model before Kubernetes kills it. This is not optional. Without it, Kubernetes will restart the pod in a loop during model loading.

GPU memory utilization at 90%. The --gpu-memory-utilization 0.90 flag tells vLLM to use up to 90% of GPU VRAM for the KV cache. The remaining 10% is headroom for model weights and CUDA overhead. You can push this to 0.95 on dedicated inference nodes, but 0.90 is a safer default.

No memory limit on CPU. We set a memory request but no limit. LLM serving can have bursty CPU memory usage during tokenization and detokenization. A hard limit risks OOM kills during traffic spikes.

The Service

apiVersion: v1
kind: Service
metadata:
  name: vllm-llama3-8b
  namespace: inference
spec:
  selector:
    app: vllm
    model: llama3-8b
  ports:
    - port: 80
      targetPort: 8000
      protocol: TCP
  type: ClusterIP

The Hugging Face Token

Llama 3 is a gated model — you need to accept Meta's license on Hugging Face and provide an access token. Create the secret before deploying:

kubectl create namespace inference

kubectl create secret generic hf-token \
  --namespace=inference \
  --from-literal=token=hf_your_token_here

Apply the manifests and watch the rollout:

kubectl apply -f deployment.yaml -f service.yaml
kubectl -n inference rollout status deployment/vllm-llama3-8b --timeout=600s

The first pod will take 2-5 minutes to become ready as it downloads and loads the model. Subsequent pods on the same node will be faster if the model is cached on disk.

OpenAI-Compatible API

One of vLLM's strongest features is its native OpenAI-compatible API. It serves /v1/chat/completions, /v1/completions, and /v1/models endpoints with the same request and response format as OpenAI's API. This means existing application code using the OpenAI SDK can switch to your self-hosted model by changing two lines — the base URL and the model name.

from openai import OpenAI

client = OpenAI(
    base_url="http://vllm-llama3-8b.inference.svc.cluster.local/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Kubernetes pod scheduling in two sentences."},
    ],
    temperature=0.7,
    max_tokens=256,
)
print(response.choices[0].message.content)

The same works with the TypeScript SDK:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://vllm-llama3-8b.inference.svc.cluster.local/v1",
  apiKey: "not-needed",
});

const completion = await client.chat.completions.create({
  model: "meta-llama/Llama-3.1-8B-Instruct",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "What are the benefits of container orchestration?" },
  ],
  temperature: 0.7,
  max_tokens: 256,
});

console.log(completion.choices[0].message.content);

To verify the deployment is working from within the cluster:

kubectl run curl-test --rm -it --restart=Never \
  --image=curlimages/curl:8.11.1 -- \
  curl -s http://vllm-llama3-8b.inference.svc.cluster.local/v1/models | jq .

You should see a JSON response listing meta-llama/Llama-3.1-8B-Instruct as an available model. If you are exposing the service externally, add authentication — vLLM does not enforce API keys by default.

Autoscaling Inference Workloads

Why HPA Alone Is Not Enough

Standard Horizontal Pod Autoscaler metrics — CPU and memory utilization — are meaningless for GPU inference. The bottleneck is GPU compute and memory, not CPU. A vLLM pod can be at 95% GPU utilization serving requests at full throughput while its CPU sits at 10%. Scaling on CPU would never trigger.

The signals that actually matter for LLM inference are:

  • Request queue depth: How many requests are waiting to be processed. If this number is consistently above zero, you need more replicas.
  • Request throughput: Requests per second being served. Useful for proactive scaling before the queue builds up.
  • Time-to-first-token latency: If P99 latency exceeds your SLA, you are under-provisioned.

vLLM exposes Prometheus metrics at /metrics including vllm:num_requests_running, vllm:num_requests_waiting, and vllm:avg_generation_throughput_toks_per_s. These are the signals we scale on.

KEDA-Based Autoscaling

First, set up a PodMonitor to scrape vLLM's Prometheus metrics:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: vllm-metrics
  namespace: inference
spec:
  selector:
    matchLabels:
      app: vllm
  podMetricsEndpoints:
    - port: http
      path: /metrics
      interval: 15s

Then create a KEDA ScaledObject that queries Prometheus for the queue depth and request rate:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-llama3-8b-scaler
  namespace: inference
spec:
  scaleTargetRef:
    name: vllm-llama3-8b
  minReplicaCount: 1
  maxReplicaCount: 8
  cooldownPeriod: 300
  pollingInterval: 15
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
        query: |
          sum(vllm:num_requests_waiting{namespace="inference", model_name="meta-llama/Llama-3.1-8B-Instruct"})
        threshold: "5"
        activationThreshold: "2"
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
        query: |
          sum(rate(vllm:request_success_total{namespace="inference"}[2m]))
        threshold: "50"
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 60
          policies:
            - type: Pods
              value: 2
              periodSeconds: 120
        scaleDown:
          stabilizationWindowSeconds: 600
          policies:
            - type: Pods
              value: 1
              periodSeconds: 300

The scale-down policy is deliberately conservative — 600 seconds of stabilization and removing only 1 pod every 5 minutes. GPU pods take 2-5 minutes to become ready after scheduling (model loading), so aggressive scale-down followed by scale-up wastes money on pods that are loading models instead of serving requests. We covered the principles behind KEDA-based scaling for Kubernetes workloads in more detail in our AI-driven monitoring and scaling guide.

Multi-Model Serving Patterns

Separate Deployments per Model

The simplest and most reliable pattern is one Deployment per model. Each model gets its own Service, ScaledObject, and resource profile. The advantages are clear: independent scaling, independent rollouts, no model-loading contention, and straightforward cost attribution per model.

The downside is manifest sprawl. If you are serving five models, you have five Deployments, five Services, and five ScaledObjects. Kustomize handles this cleanly:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - base/

patches:
  - target:
      kind: Deployment
      name: vllm-model
    patch: |
      - op: replace
        path: /metadata/name
        value: vllm-mistral-7b
      - op: replace
        path: /spec/template/spec/containers/0/args/1
        value: mistralai/Mistral-7B-Instruct-v0.3
      - op: replace
        path: /spec/selector/matchLabels/model
        value: mistral-7b
      - op: replace
        path: /spec/template/metadata/labels/model
        value: mistral-7b
  - target:
      kind: Service
      name: vllm-model
    patch: |
      - op: replace
        path: /metadata/name
        value: vllm-mistral-7b
      - op: replace
        path: /spec/selector/model
        value: mistral-7b

Create an overlay per model, and kustomize build gives you the complete manifest set.

Model Routing with a Gateway

For teams serving five or more models, a routing layer simplifies client integration. Instead of each application knowing which Kubernetes Service to call for each model, clients hit a single gateway that routes based on the model field in the OpenAI-compatible request body:

from contextlib import asynccontextmanager
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import JSONResponse
import httpx

MODEL_BACKENDS = {
    "meta-llama/Llama-3.1-8B-Instruct": "http://vllm-llama3-8b.inference",
    "mistralai/Mistral-7B-Instruct-v0.3": "http://vllm-mistral-7b.inference",
    "meta-llama/Llama-3.1-70B-Instruct": "http://vllm-llama3-70b.inference",
}

@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.http_client = httpx.AsyncClient(timeout=120.0)
    yield
    await app.state.http_client.aclose()

app = FastAPI(lifespan=lifespan)

@app.post("/v1/chat/completions")
async def route_chat(request: Request):
    body = await request.json()
    model = body.get("model", "")
    backend = MODEL_BACKENDS.get(model)
    if not backend:
        raise HTTPException(status_code=404, detail=f"Unknown model: {model}")

    resp = await request.app.state.http_client.post(
        f"{backend}/v1/chat/completions",
        json=body,
    )
    return JSONResponse(content=resp.json(), status_code=resp.status_code)

@app.get("/v1/models")
async def list_models():
    return {
        "object": "list",
        "data": [
            {"id": model, "object": "model"} for model in MODEL_BACKENDS
        ],
    }

This gateway is stateless and lightweight — deploy it on CPU nodes. Clients use a single base URL and the standard OpenAI SDK, and the gateway handles routing transparently.

Production Guardrails

Health Checks and Startup Probes

We already covered the probe configuration in the Deployment manifest, but it bears repeating: LLM model loading is the critical startup bottleneck. A Llama 3 8B model takes 60-90 seconds to load on an L4 GPU. A 70B model on A100s can take 3-5 minutes. If your startup probe's total timeout (failureThreshold * periodSeconds) is shorter than the model loading time, Kubernetes will kill the pod in an infinite restart loop.

We set failureThreshold: 30 with periodSeconds: 10 for a total budget of 360 seconds. Adjust upward for larger models.

Graceful Shutdown

When a pod is terminated — during a scale-down, rolling update, or node drain — vLLM needs time to finish in-flight requests. Streaming responses in particular can take 10-30 seconds for long completions.

Set terminationGracePeriodSeconds to at least 120 seconds. Adding a preStop hook with a short sleep gives the Service time to remove the pod from its endpoint list before vLLM starts rejecting new connections:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 15"]
terminationGracePeriodSeconds: 120

The 15-second sleep in the preStop hook gives kube-proxy time to update iptables rules so new requests stop arriving. Then vLLM has the remaining ~105 seconds to drain in-flight requests.

Resource Quotas

Prevent runaway GPU allocation by setting a ResourceQuota on the inference namespace:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: inference-gpu-quota
  namespace: inference
spec:
  hard:
    requests.nvidia.com/gpu: "16"
    limits.nvidia.com/gpu: "16"
    requests.memory: "256Gi"

This caps the namespace at 16 GPUs total. Without it, a misconfigured ScaledObject or manual scaling mistake can allocate every GPU in your cluster.

Monitoring and Alerting

Set up Prometheus alerts for the failure modes that matter most:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: vllm-alerts
  namespace: inference
spec:
  groups:
    - name: vllm.rules
      rules:
        - alert: VLLMHighQueueDepth
          expr: sum(vllm:num_requests_waiting{namespace="inference"}) > 20
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: "vLLM request queue depth is high"
            description: "{{ $value }} requests waiting in queue for over 2 minutes. Consider scaling up or check if KEDA is functioning."
        - alert: VLLMHighLatency
          expr: histogram_quantile(0.99, rate(vllm:request_latency_seconds_bucket{namespace="inference"}[5m])) > 10
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "vLLM P99 latency exceeds 10 seconds"
            description: "P99 request latency is {{ $value }}s. This typically indicates GPU saturation or insufficient replicas."
        - alert: VLLMNoPods
          expr: kube_deployment_status_replicas_available{deployment=~"vllm-.*", namespace="inference"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "No vLLM pods available for serving"
            description: "Deployment {{ $labels.deployment }} has zero available replicas. Inference is down."

The VLLMHighQueueDepth alert is your early warning. If the queue is consistently non-empty and KEDA is not scaling up, something is wrong — either KEDA is misconfigured, there are no schedulable GPU nodes, or you have hit the maxReplicaCount ceiling.

Getting Started

If you are evaluating self-hosted LLMs for the first time, we recommend a phased approach:

  1. Start with a single model on a single GPU. Deploy Llama 3.1 8B or Mistral 7B on an L4 node. Get the Deployment, Service, and health checks working. Validate that the OpenAI-compatible API works with your existing application code by changing only the base URL.

  2. Add observability. Deploy the PodMonitor and verify that vLLM metrics appear in Prometheus. Set up a Grafana dashboard tracking request throughput, queue depth, GPU utilization, and latency percentiles.

  3. Enable autoscaling. Install KEDA and configure the ScaledObject. Load test with a realistic traffic pattern and verify that pods scale up under load and back down during quiet periods.

  4. Expand to additional models. Use the Kustomize overlay pattern to deploy additional models. If you are serving more than three models, consider adding the routing gateway.

The operational complexity is real but manageable. Teams with existing Kubernetes experience typically have a production inference stack running in one to two weeks. The hardest part is not the Kubernetes manifests. It is deciding which models to serve and at what quality bar. The infrastructure should follow that decision, not constrain it.