llm

12 posts tagged “llm”

May 24, 2026

The Cost-Efficient AI Stack: Ship AI Features Without the Runaway Bill

Most teams overpay for AI by routing every request to a frontier model. This is the architecture we build instead — hybrid cloud+local routing, self-hosted inference, agent orchestration, and cost-per-request observability — and the single principle that ties it together: send each unit of work to the cheapest model that can do it well.

ai llm cost-optimization hybrid infrastructure finops

June 19, 2026

The Local AI Inflection Point: What the Next Three Years Actually Look Like

Local AI is crossing a threshold where on-device and self-hosted models stop being cost-cutting compromises and start being the default choice. Here's what's driving that shift and what it means for how you build software.

ai local-models llm inference edge-computing

June 7, 2026

Building a Hybrid LLM Platform on EKS, Part 5: Serving Local Models with vLLM and KEDA

Part 5 of our hands-on EKS series. We deploy vLLM model servers on the GPU pool from Part 4, load Qwen2.5-7B model weights from Amazon S3 via an init container, and wire KEDA autoscaling that scales replicas with live queue depth and drives GPU nodes to zero overnight.

eks kubernetes aws-cdk vllm keda gpu autoscaling llm ai-infrastructure typescript

June 7, 2026

Building a Hybrid LLM Platform on EKS, Part 6: The Hybrid Router

Part 6 of our hands-on EKS series. We build a TypeScript/Hono router that sits in front of both vLLM and the Anthropic API, routes each request to the right backend based on model name and complexity heuristics, and falls back to cloud when the local model is cold-starting.

eks kubernetes aws-cdk hono typescript llm routing hybrid-ai ai-infrastructure

June 7, 2026

Building a Hybrid LLM Platform on EKS, Part 8: Testing, Load, and Examples

The final part of our EKS series. We write integration tests with Vitest, load-test the ALB with k6, build three real-world TypeScript workloads that prove the hybrid routing works, and use the Grafana and Langfuse dashboards from Part 7 to verify the platform under traffic.

eks kubernetes aws-cdk vitest k6 testing typescript llm ai-infrastructure

May 29, 2026

Securing Self-Hosted LLMs and AI Agents on Kubernetes

Harden self-hosted vLLM and AI agents on Kubernetes: an auth/rate-limit gateway, gVisor tool sandboxing, prompt-injection guardrails, scoped secrets, and signed model weights — mapped to the OWASP LLM Top 10.

security ai agents kubernetes llm prompt-injection supply-chain

May 24, 2026

Building a Hybrid LLM Platform on EKS, Part 1: Architecture and the Network Foundation

Part 1 of a hands-on series building the EKS-based hybrid LLM platform referenced throughout this blog. We map out the full architecture, then provision the VPC, subnets, NAT, and VPC endpoints with AWS CDK — the network foundation every later part builds on.

eks kubernetes aws-cdk llm ai-infrastructure hybrid-ai vpc typescript

May 23, 2026

Build a Personal AI Dev Environment: Hybrid Models, Local Inference, and a Workflow That Costs Almost Nothing

The production patterns we deploy for teams — hybrid cloud/local routing, self-hosted models, agent orchestration — scaled down to a single developer's workstation. A practical guide to building a personal AI dev environment with Ollama, Claude Code, and a local router that keeps your token bill near zero.

ai llm local-models ollama claude-code developer-tools

May 22, 2026

The Agent Control Plane: Frontier Models Plan, Your Kubernetes Fleet Executes

How to orchestrate a fleet of AI agents using a shared task queue — frontier models like Claude handle planning and decomposition, while a local Kubernetes worker pool runs the high-volume execution tasks. Covers the task ledger, dynamic task creation, lane-based routing, and KEDA autoscaling.

ai agents orchestration kubernetes llm hybrid

May 21, 2026

Observability for LLM Applications on Kubernetes: Tokens, Traces, and Cost per Request

How to instrument self-hosted and hybrid LLM workloads with OpenTelemetry, Prometheus, and Langfuse — tracking time-to-first-token, tokens per second, GPU utilization, and unit economics down to the individual request.

kubernetes llm observability opentelemetry finops ai-infrastructure

May 14, 2026

The Hybrid AI Playbook: Cloud Models for Thinking, Local Models for Doing

How to cut your AI costs by 60-80% using a hybrid approach — Claude or GPT for planning and complex reasoning, local models like Llama and Qwen for execution tasks like code generation, summarization, and data extraction.

ai llm cost-optimization local-models ollama

April 3, 2026

Self-Hosting LLMs on Kubernetes: A Practical Guide

How to deploy, serve, and autoscale open-source large language models on Kubernetes with vLLM — from GPU node pools and deployment manifests to KEDA-based autoscaling and production guardrails.

kubernetes llm gpu ai-infrastructure self-hosting