Building a Production Feature Flag Service with Claude Code

Shipping features to production without a kill switch is a gamble most teams take until something breaks at 2 AM. Feature flags solve this by decoupling deployment from release, but most self-hosted solutions are either too simplistic or too expensive. We set out to build FlagSignals — a multi-tenant feature flag service with targeting rules, A/B testing, and subscription billing — and we built the entire thing using Claude Code as our development partner.

This is the story of the technology choices, the architecture, and what AI-assisted development actually looks like on a non-trivial project.

The Tech Stack

FlagSignals runs on a modern full-stack architecture designed for low-latency flag evaluation at the edge and a rich management dashboard.

Frontend and API layer:

  • Next.js 16 with the App Router and React Server Components
  • React 19 with TypeScript in strict mode
  • Tailwind CSS 4 with shadcn/ui components for a consistent dark-themed UI
  • React Hook Form and Zod for form validation

Database and auth:

  • Supabase (managed PostgreSQL) with Row Level Security for multi-tenant data isolation
  • Supabase Auth for email/password authentication and session management
  • Seven incremental SQL migrations covering the full schema evolution

Edge runtime:

  • Supabase Edge Functions (Deno-based) for flag evaluation and conversion tracking
  • Deterministic hashing for sticky experiment assignment
  • Sub-millisecond evaluation with global distribution

Billing and email:

  • Stripe for subscription management, checkout, and customer portal
  • Webhook-driven subscription lifecycle (created, updated, deleted, payment failed)
  • Resend for transactional team invitation emails

Deployment:

  • Vercel for the Next.js application
  • Supabase Cloud for database, auth, and edge functions

This stack was chosen for a specific reason: minimize operational overhead while maximizing capability. Supabase gives us Postgres, auth, realtime, and serverless functions in one platform. Next.js on Vercel handles both the dashboard UI and the API routes with zero infrastructure management. Stripe handles the entire billing lifecycle so we never touch payment card data.

Architecture Decisions That Mattered

Multi-Tenant Isolation with Row Level Security

The most critical architectural decision was using PostgreSQL Row Level Security (RLS) instead of application-level tenant filtering. Every table in the database has RLS policies that enforce organization-level isolation:

-- Helper function used across all policies
CREATE FUNCTION user_org_ids() RETURNS SETOF uuid AS $$
  SELECT org_id FROM org_members
  WHERE user_id = auth.uid()
$$ LANGUAGE sql SECURITY DEFINER;

-- Example policy on the projects table
CREATE POLICY "Users can view their org projects"
  ON projects FOR SELECT
  USING (org_id IN (SELECT user_org_ids()));

This means even if application code has a bug, the database itself prevents cross-tenant data access. The tradeoff is complexity in writing policies and the need for a service-role client that bypasses RLS for administrative operations like webhook processing.

Edge Functions for Flag Evaluation

The flag evaluation endpoint is the hottest path in the system — every feature check in a client application hits it. We deployed this as a Supabase Edge Function running on Deno, which gives us global distribution and fast cold starts.

The evaluation logic follows a clear priority chain:

  1. If the flag is disabled, return the default value
  2. If there is a running experiment with a matching user, assign a variant deterministically
  3. If targeting rules match the provided context, return the rule value
  4. Otherwise, return the default enabled value
// Deterministic experiment assignment using hashing
const hash = await crypto.subtle.digest(
  'SHA-256',
  new TextEncoder().encode(`${experimentId}:${userId}`)
);
const hashArray = new Uint8Array(hash);
const bucket = (hashArray[0]! << 8 | hashArray[1]!) % 10000;
const isInExperiment = bucket < (trafficAllocation * 100);

This ensures the same user always sees the same variant without storing assignment state on the client. Assignments are recorded asynchronously to the database for analytics.

Stripe Webhook Architecture

Rather than polling Stripe for subscription status, we use webhooks to keep our database in sync. The webhook handler processes four event types:

// Webhook events we handle
switch (event.type) {
  case 'checkout.session.completed':
    // Create or update subscription record
  case 'customer.subscription.updated':
    // Sync tier, status, and period dates
  case 'customer.subscription.deleted':
    // Mark subscription as canceled
  case 'invoice.payment_failed':
    // Update status to past_due
}

Subscription limits are then enforced in the evaluate edge function — if a trial has expired or the monthly API request quota is exhausted, flag evaluations return an error rather than silently failing.

Building with Claude Code

We used Claude Code throughout the entire development process, from initial scaffolding to the final deployment guide. Here is what that workflow actually looked like in practice.

Schema design and migrations. We described the data model — organizations, projects, environments, flags, targeting rules, experiments — and Claude Code generated the SQL migrations including RLS policies, helper functions, and proper foreign key relationships. The multi-tenant RLS pattern with user_org_ids() and user_has_org_role() helper functions came from iterating on the security model through conversation.

API routes and edge functions. The evaluate edge function is roughly 300 lines of TypeScript handling API key validation, subscription checks, targeting rule evaluation, experiment assignment, and usage metering. Claude Code wrote the initial implementation and then we refined the evaluation priority chain, error handling, and the deterministic hashing logic for experiments through successive iterations.

Dashboard UI. The management dashboard includes project CRUD, flag management with per-environment overrides, targeting rule configuration, experiment creation and analytics, team member invitations, and billing management. Claude Code generated the components using shadcn/ui primitives, wired up the API calls, and handled the form validation with Zod schemas.

Stripe integration. Billing was the most complex integration. Claude Code implemented the checkout flow, webhook handler, customer portal redirect, subscription tier enforcement, and usage metering. The webhook signature verification and idempotent event processing required careful attention to Stripe's specific patterns.

A/B testing analytics. The experiment analytics endpoint calculates conversion rates, relative improvement (lift), and statistical significance using a z-test for proportions. Claude Code implemented the statistical calculations and the daily breakdown aggregation query:

SELECT
  DATE(created_at) as day,
  variant,
  COUNT(*) as assignments,
  COUNT(DISTINCT ec.user_identifier) as conversions
FROM experiment_assignments ea
LEFT JOIN experiment_conversions ec USING (experiment_id, user_identifier)
GROUP BY day, variant
ORDER BY day;

What Worked Well

Iteration speed on boilerplate. Multi-tenant SaaS has enormous amounts of repetitive pattern code — CRUD endpoints, RLS policies, form components, API route handlers. Claude Code excels at generating these consistently once you establish the pattern for the first one.

Complex logic with clear specs. The targeting rule evaluator, experiment assignment algorithm, and analytics calculations all have well-defined inputs and outputs. Describing the expected behavior and edge cases to Claude Code produced correct implementations faster than writing them from scratch.

Cross-stack consistency. Claude Code maintained consistency between the database schema types, the TypeScript types in database.types.ts, the API route handlers, and the frontend components. When we added the experiments feature, it generated matching changes across all layers.

Where Human Judgment Was Essential

Security model review. Every RLS policy and authorization check required careful human review. AI-generated security code can look correct while having subtle gaps — for example, ensuring that the service-role client is never exposed to the frontend, or that webhook signature verification cannot be bypassed.

Architectural tradeoffs. Decisions like using edge functions versus API routes for evaluation, choosing RLS over application-level filtering, and structuring the subscription enforcement — these required understanding the operational implications that only come from production experience.

Third-party API integration. Stripe's webhook patterns, Supabase's auth flow quirks, and Resend's API all have undocumented behaviors and edge cases. Claude Code provided the structure, but getting these integrations production-ready required testing against real APIs and reading changelogs.

The Result

FlagSignals ships as a complete feature flag platform with four subscription tiers (Free, Starter at $19/month, Pro at $79/month, and Enterprise), supporting boolean, string, number, and JSON flag types. It includes percentage-based rollouts, attribute-based targeting with 12 operators, A/B testing with statistical significance analysis, team collaboration with role-based access control, and a full audit log.

The entire codebase — dashboard, API, edge functions, migrations, and deployment configuration — was built with Claude Code as a development accelerator. The key insight is that AI-assisted development works best when you treat the AI as a fast, knowledgeable pair programmer rather than an autonomous agent. You still need to understand the architecture, review the security model, and test against real services. But the iteration speed on a project of this scope is dramatically higher than writing every line by hand.

For teams evaluating whether AI-assisted development is practical for production applications: it is, provided you maintain the same review standards you would apply to any code contribution. The technology does not replace engineering judgment — it amplifies it.