How to Cut Your AWS Bill in Half Without Changing Your Architecture

Most growing teams are overpaying on AWS by 30-50%. Not because they made bad decisions — because nobody has looked at the bill in a while. Infrastructure gets set up when things are moving fast, resources get provisioned for peak load or a one-time migration, and then nobody goes back to right-size them.

The result is predictable: EBS volumes attached to nothing, NAT Gateways burning money around the clock, RDS instances running at 8% CPU utilization on a db.r6g.xlarge, and load balancers sitting in front of services that were decommissioned months ago.

This is the exact checklist we run through in every infrastructure audit. The findings are surprisingly consistent across companies of all sizes. A team spending $8,000/month on AWS typically has $2,500-$4,000 in waste hiding in plain sight.

1. Unattached EBS Volumes

This is the single most common source of waste and the easiest to fix. When you terminate an EC2 instance, its EBS volumes are not automatically deleted unless you explicitly configured that behavior. The volumes persist, accruing storage charges, attached to nothing.

Find them:

aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query "Volumes[*].{ID:VolumeId,Size:Size,Type:VolumeType,Created:CreateTime}" \
  --output table

Any volume with a status of available is unattached. Review the list — some may be intentional backups, but most are leftovers from terminated instances.

The cost adds up faster than people expect. A single unattached 500GB gp3 volume costs about $40/month. Find ten of those across your account and you are paying $400/month for storage nobody is using.

Before deleting, snapshot anything you might need:

# Snapshot first, then delete
aws ec2 create-snapshot \
  --volume-id vol-0123456789abcdef0 \
  --description "Backup before cleanup - $(date +%Y-%m-%d)"

# After confirming snapshot is complete
aws ec2 delete-volume --volume-id vol-0123456789abcdef0

Snapshots cost a fraction of live volumes ($0.05/GB-month vs. $0.08/GB-month for gp3) and can be restored if you ever need the data.

Prevention: Set DeleteOnTermination: true for root volumes in your launch templates and Terraform/CDK configurations. For non-root volumes, tag them with the instance they belong to so you can trace orphans.

# Terraform — prevent orphaned root volumes
resource "aws_instance" "app" {
  ami           = var.ami_id
  instance_type = var.instance_type

  root_block_device {
    volume_size           = 30
    volume_type           = "gp3"
    delete_on_termination = true
  }
}

2. NAT Gateway Costs

This is the line item that shocks people. NAT Gateways charge $0.045 per hour ($32.40/month) just to exist, plus $0.045 per GB of data processed. A single NAT Gateway handling moderate traffic — container image pulls, API calls to external services, package downloads — can easily cost $150-$300/month. If you have one per availability zone (the recommended HA setup), multiply that by two or three.

Check your NAT Gateway spend:

# Get NAT Gateway data processing over the last 7 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name BytesOutToDestination \
  --dimensions Name=NatGatewayId,Value=nat-0123456789abcdef0 \
  --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 86400 \
  --statistics Sum \
  --output table

Common fixes:

Use VPC endpoints for AWS services. If your private subnets are talking to S3, ECR, DynamoDB, SQS, or other AWS services through the NAT Gateway, you are paying data processing charges for traffic that could go over a free or near-free VPC endpoint instead.

# Gateway endpoint for S3 — free, no per-GB charge
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.${var.region}.s3"

  route_table_ids = [
    aws_route_table.private_a.id,
    aws_route_table.private_b.id,
  ]
}

# Interface endpoint for ECR — $0.01/hr but eliminates NAT data charges
resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.${var.region}.ecr.api"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true

  subnet_ids = [
    aws_subnet.private_a.id,
    aws_subnet.private_b.id,
  ]

  security_group_ids = [aws_security_group.vpc_endpoints.id]
}

S3 gateway endpoints are free. ECR, CloudWatch, and other interface endpoints cost $0.01/hour each but eliminate the $0.045/GB NAT data processing charge. If you are pulling container images from ECR multiple times a day, the endpoint pays for itself immediately.

Evaluate whether you need a NAT Gateway per AZ. The high-availability recommendation is one NAT Gateway per availability zone. But if your workload can tolerate a brief interruption during an AZ failure, a single NAT Gateway saves $32.40/month per eliminated gateway plus the associated data charges. For non-production environments, this is almost always the right call.

Check what traffic is actually going through NAT. Use VPC Flow Logs to identify the top talkers:

# Enable flow logs if not already active
aws ec2 create-flow-logs \
  --resource-type VPC \
  --resource-ids vpc-0123456789abcdef0 \
  --traffic-type ALL \
  --log-destination-type cloud-watch-logs \
  --log-group-name /vpc/flow-logs

# Query with CloudWatch Logs Insights
# Top destinations by bytes through NAT
filter interfaceId = "eni-nat-gateway-id"
| stats sum(bytes) as totalBytes by dstAddr
| sort totalBytes desc
| limit 20

You will often find that the majority of NAT traffic is going to AWS service endpoints — traffic that VPC endpoints would handle for free or near-free.

3. Oversized RDS Instances

RDS is one of the largest line items for most teams, and the instances are almost always oversized. The default behavior is understandable: when setting up a production database, engineers pick a generous instance size to avoid performance issues. Then traffic never grows to justify it, but nobody goes back to resize.

Check your RDS utilization:

# CPU utilization over the last 2 weeks
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name CPUUtilization \
  --dimensions Name=DBInstanceIdentifier,Value=production-db \
  --start-time $(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Average Maximum \
  --output table

If your average CPU utilization is below 20% and your peak is below 50%, you are almost certainly running an instance class larger than you need.

Common right-sizing moves:

| Current Instance | Typical Right-Size | Monthly Savings | |---|---|---| | db.r6g.xlarge (4 vCPU, 32 GB) | db.r6g.large (2 vCPU, 16 GB) | ~$350/mo | | db.r6g.2xlarge (8 vCPU, 64 GB) | db.r6g.xlarge (4 vCPU, 32 GB) | ~$700/mo | | db.m6g.xlarge (4 vCPU, 16 GB) | db.m6g.large (2 vCPU, 8 GB) | ~$200/mo |

Check memory too, not just CPU. RDS Performance Insights (free tier available) shows whether your workload is CPU-bound or memory-bound. If freeable memory stays above 50%, you can likely drop to a smaller instance class.

Consider Aurora Serverless v2 for variable workloads. If your database usage spikes during business hours and drops to near-zero overnight, Aurora Serverless v2 scales to a minimum of 0.5 ACU ($0.12/hr) during off-hours. For a database that is busy 8 hours a day and idle the other 16, this can cut costs by 60% compared to a provisioned instance sized for peak load.

Do not skip Multi-AZ evaluation. Multi-AZ deployments double your RDS cost. For production databases, this is worth it. For staging and development databases, it rarely is. Check which environments have Multi-AZ enabled:

aws rds describe-db-instances \
  --query "DBInstances[*].{ID:DBInstanceIdentifier,Class:DBInstanceClass,MultiAZ:MultiAZ,Engine:Engine}" \
  --output table

Turning off Multi-AZ on a staging database that nobody uses outside business hours saves the full cost of the standby instance.

4. Idle and Orphaned Load Balancers

Every Application Load Balancer costs approximately $22/month in fixed charges before it processes a single request. Add target groups, SSL certificates, and actual traffic, and each ALB runs $25-$50/month.

Find load balancers with no healthy targets or zero traffic:

# List all ALBs
aws elbv2 describe-load-balancers \
  --query "LoadBalancers[*].{Name:LoadBalancerName,ARN:LoadBalancerArn,State:State.Code}" \
  --output table

# Check request count for each ALB over the last 7 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB \
  --metric-name RequestCount \
  --dimensions Name=LoadBalancer,Value=app/my-alb/1234567890abcdef \
  --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 604800 \
  --statistics Sum \
  --output text

If a load balancer has zero requests over seven days, it is almost certainly orphaned. Common causes: a service was decommissioned but the ALB was not cleaned up, or a Kubernetes Ingress was deleted but the AWS Load Balancer Controller did not properly clean up the ALB.

Also check Classic Load Balancers. If your account has been around for a few years, you may still have CLBs running. They cost about the same as ALBs but lack features. Migrating to ALB or deleting unused CLBs is free savings.

5. CloudWatch Log Retention

CloudWatch Logs default retention is "Never expire." Every log line your application produces stays in CloudWatch forever, accruing storage charges at $0.03 per GB per month. For an application producing 10 GB of logs per month, that is $3.60/month in year one, $7.20/month in year two, and so on — growing permanently.

Check your log groups:

aws logs describe-log-groups \
  --query "logGroups[?retentionInDays==null].{Name:logGroupName,StoredBytes:storedBytes}" \
  --output table

Any log group without a retentionInDays value is retaining logs forever.

Set retention policies based on actual need:

# Application logs — 30 days is usually sufficient
aws logs put-retention-policy \
  --log-group-name /ecs/my-application \
  --retention-in-days 30

# Infrastructure logs — 14 days
aws logs put-retention-policy \
  --log-group-name /aws/eks/my-cluster/cluster \
  --retention-in-days 14

# VPC flow logs — 7 days (archive to S3 if you need long-term)
aws logs put-retention-policy \
  --log-group-name /vpc/flow-logs \
  --retention-in-days 7

If you need logs longer than 30 days for compliance, export them to S3 where storage costs $0.023/GB-month (standard) or $0.004/GB-month (Glacier) — a fraction of CloudWatch pricing.

Automate this across all log groups:

# Set 30-day retention on all log groups that have no retention policy
for log_group in $(aws logs describe-log-groups \
  --query "logGroups[?retentionInDays==null].logGroupName" \
  --output text); do
  echo "Setting 30-day retention on: $log_group"
  aws logs put-retention-policy \
    --log-group-name "$log_group" \
    --retention-in-days 30
done

6. Elastic IPs Not Attached to Running Instances

Elastic IPs are free when attached to a running instance. They cost $0.005 per hour ($3.60/month) when not attached. AWS also now charges $0.005/hr for all public IPv4 addresses, including those on running instances, but the waste from unattached EIPs is the avoidable portion.

aws ec2 describe-addresses \
  --query "Addresses[?AssociationId==null].{IP:PublicIp,AllocationId:AllocationId}" \
  --output table

Release any EIP that is not attached to a running resource. If you are holding EIPs "just in case," release them — you can allocate a new one in seconds when you actually need it.

7. Old EBS Snapshots

Snapshots are cheap ($0.05/GB-month) but accumulate over time. Automated backup tools and AMI builders create snapshots that are never cleaned up. After a year of daily snapshots of a 200GB volume, you are paying $120/month in snapshot storage.

Find snapshots older than 90 days:

aws ec2 describe-snapshots \
  --owner-ids self \
  --query "Snapshots[?StartTime<='$(date -u -d '90 days ago' +%Y-%m-%dT%H:%M:%S)'].{ID:SnapshotId,Size:VolumeSize,Created:StartTime,Description:Description}" \
  --output table

Review before deleting — some snapshots back AMIs that are still in use. Check AMI associations:

# Find which AMIs reference a snapshot
aws ec2 describe-images \
  --owners self \
  --query "Images[*].{AMI:ImageId,Snapshots:BlockDeviceMappings[*].Ebs.SnapshotId}" \
  --output table

Set up a lifecycle policy with AWS Data Lifecycle Manager to automate snapshot retention:

resource "aws_dlm_lifecycle_policy" "ebs_snapshots" {
  description        = "Automated EBS snapshot lifecycle"
  execution_role_arn = aws_iam_role.dlm.arn
  state              = "ENABLED"

  policy_details {
    resource_types = ["VOLUME"]

    schedule {
      name = "Daily snapshots with 14-day retention"

      create_rule {
        interval      = 24
        interval_unit = "HOURS"
        times         = ["03:00"]
      }

      retain_rule {
        count = 14
      }

      tags_to_add = {
        SnapshotType = "automated"
      }
    }

    target_tags = {
      Backup = "true"
    }
  }
}

8. Savings Plans and Reserved Instances

If you have stable baseline compute — instances or Fargate tasks that run 24/7 — you are leaving significant money on the table by paying on-demand rates.

Compute Savings Plans offer the best flexibility:

  • 1-year no-upfront commitment: ~30% savings
  • 1-year all-upfront: ~35% savings
  • 3-year all-upfront: ~50% savings

Savings Plans apply automatically to EC2, Fargate, and Lambda usage. You commit to a dollar-per-hour amount, not specific instance types, so you retain flexibility to change instance sizes and families.

How to size your commitment:

# Check your consistent baseline usage over the last 30 days
# AWS Cost Explorer → Savings Plans → Recommendations
# Or use the CLI:
aws ce get-savings-plans-purchase-recommendation \
  --savings-plans-type COMPUTE_SP \
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --lookback-period-in-days THIRTY_DAYS

The recommendation engine analyzes your usage history and suggests an hourly commitment that covers your steady-state usage without over-committing. Start conservative — cover 60-70% of your baseline — and add more after a month if your usage is stable.

Do not buy Savings Plans for variable or shrinking workloads. The commitment is binding. If your usage drops, you pay for capacity you are not using. Only commit to the floor of your usage, not the average.

9. Dev and Staging Environments Running 24/7

Development and staging environments that run around the clock when the team works 8-10 hours a day are paying for 14-16 hours of idle time daily. That is 60-70% waste on those environments.

For EC2-based environments, use Instance Scheduler:

# Simple approach: stop/start on a schedule via EventBridge
resource "aws_scheduler_schedule" "stop_dev" {
  name       = "stop-dev-instances"
  group_name = "default"

  flexible_time_window {
    mode = "OFF"
  }

  schedule_expression = "cron(0 19 ? * MON-FRI *)"  # 7 PM weekdays

  target {
    arn      = "arn:aws:scheduler:::aws-sdk:ec2:stopInstances"
    role_arn = aws_iam_role.scheduler.arn

    input = jsonencode({
      InstanceIds = var.dev_instance_ids
    })
  }
}

resource "aws_scheduler_schedule" "start_dev" {
  name       = "start-dev-instances"
  group_name = "default"

  flexible_time_window {
    mode = "OFF"
  }

  schedule_expression = "cron(0 7 ? * MON-FRI *)"  # 7 AM weekdays

  target {
    arn      = "arn:aws:scheduler:::aws-sdk:ec2:startInstances"
    role_arn = aws_iam_role.scheduler.arn

    input = jsonencode({
      InstanceIds = var.dev_instance_ids
    })
  }
}

For EKS-based environments, scale the node group to zero overnight and use Karpenter to scale back up when pods are scheduled in the morning.

For RDS, stop development databases when not in use. Stopped RDS instances automatically restart after 7 days (an AWS limitation), so pair the stop action with a Lambda that re-stops it if it is outside business hours.

10. Data Transfer Costs

Data transfer is the most overlooked cost category because it does not map to a single service in the AWS bill. It appears as line items across EC2, ELB, NAT Gateway, CloudFront, and others.

Common sources of unexpected data transfer charges:

  • Cross-AZ traffic: EC2 instances and pods talking to RDS, ElastiCache, or other services in a different AZ. Each direction costs $0.01/GB. For a chatty microservices architecture, this adds up quickly. Place databases and their primary consumers in the same AZ when possible.

  • Internet egress without CloudFront: Serving static assets directly from your application instances instead of through CloudFront. CloudFront's data transfer pricing ($0.085/GB) is cheaper than direct EC2 egress ($0.09/GB), and caching reduces the total bytes transferred.

  • S3 in the wrong region: An application in us-east-1 reading from an S3 bucket in us-west-2 pays cross-region data transfer charges. Check that your buckets are co-located with the services that access them.

Use Cost Explorer with the "Data Transfer" service filter and group by usage type to identify the biggest contributors.

Running the Audit

Here is the order we recommend. Start with the items that require no commitment and no downtime — the pure waste elimination — then move to the optimization decisions that require evaluation.

No-risk cleanup (do these now):

  1. Delete unattached EBS volumes (after snapshotting)
  2. Release unattached Elastic IPs
  3. Set CloudWatch log retention policies
  4. Delete orphaned load balancers
  5. Clean up old EBS snapshots

Low-risk right-sizing (schedule a maintenance window): 6. Right-size RDS instances based on 2 weeks of metrics 7. Turn off Multi-AZ on non-production databases 8. Add VPC endpoints for S3, ECR, and other high-traffic AWS services

Scheduled savings (implement and monitor): 9. Schedule dev/staging environments to stop outside business hours 10. Purchase Compute Savings Plans for stable baseline workloads

Ongoing monitoring:

  • Set up AWS Budgets with alerts at 80% and 100% of expected monthly spend
  • Review Cost Explorer weekly for the first month, then monthly
  • Tag all resources with team, environment, and service for cost allocation
# Set a budget alert
aws budgets create-budget \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget '{
    "BudgetName": "Monthly AWS Spend",
    "BudgetLimit": {"Amount": "5000", "Unit": "USD"},
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST"
  }' \
  --notifications-with-subscribers '[
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 80,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {"SubscriptionType": "EMAIL", "Address": "you@example.com"}
      ]
    }
  ]'

The Bottom Line

AWS cost optimization is not about switching to cheaper services or rearchitecting your application. It is about eliminating waste that accumulates naturally as your infrastructure grows. Unattached volumes, oversized databases, load balancers attached to nothing, and logs retained forever are the norm, not the exception.

The ten items in this checklist typically recover 30-50% of monthly spend for teams that have not done a recent audit. The cleanup takes a day. The right-sizing takes a week. The savings compound every month.

If your AWS bill has been climbing and nobody has looked at it in six months, the waste is there. The only question is how much.