Cloud & DevOps · Cloud Costs

Cloud Cost Optimization: 15 Leaks You Can Fix This Month

Real savings from right-sizing, storage, and idle resources.

Reading time: ~8–12 min
Level: All levels
Updated:

Cloud cost optimization is rarely about one “big lever.” It’s about stopping dozens of small leaks: overprovisioned compute, storage that never expires, idle environments, surprise data transfer, and services no one remembers creating. The good news: you can get real savings this month without risky refactors.


Quickstart

If you only have an hour, start here. These steps prioritize “low-risk, high-confidence” savings: things that are obviously unused, obviously oversized, or missing guardrails.

Day 1: Stop the bleeding (60–90 minutes)

  • Turn on cost alerts: a basic budget + anomaly detection (per account/project and for the biggest services)
  • Find idle resources: stopped VMs with attached disks, unattached volumes, unused IPs, orphaned load balancers
  • Set log retention: avoid “forever logs” (pick a default retention that matches your compliance needs)
  • Add a “Who owns this?” tag: team/service/env on new resources (even before you fix the past)

Day 2–7: Quick wins that usually pay back

  • Right-size the top 5 compute spenders using 14–30 days of metrics (CPU, memory, and disk I/O)
  • Storage lifecycle: move older objects/snapshots to cheaper tiers or delete by policy
  • Kill zombie images in registries and stale artifacts in CI storage
  • Schedule dev/test to shut down nights/weekends (or migrate to on-demand ephemeral envs)
Triage order: maximize savings with minimal risk
  1. Unused (safe to delete after verifying) → IPs, unattached disks, empty load balancers
  2. Over-retained (safe to reduce) → logs, snapshots, artifacts
  3. Overprovisioned (safe to tune) → instances, databases, Kubernetes requests
  4. Architectural (bigger projects) → data transfer patterns, NAT gateways, chatty services
Before you delete anything
  • Verify owner + last access (tags, audit logs, metrics, or service catalog)
  • Take a snapshot/backup if the blast radius is unclear
  • Prefer “disable / detach / quarantine” over “hard delete” for the first pass
  • Document what you changed and how to roll back

Overview

“Cloud spend” feels mysterious when it arrives as a single monthly invoice. Cloud cost optimization becomes straightforward once you split costs into drivers (compute, storage, network, managed services) and ownership (teams/services/environments). This post gives you a month-long plan and a list of 15 common leaks you can fix with minimal drama.

What you’ll walk away with

  • A practical sequence: visibility → cleanup → right-sizing → commitments → guardrails
  • 15 cost leaks (compute, storage, network, Kubernetes, process) with fixes and gotchas
  • Templates you can copy into your workflow (audit scripts, tagging discipline, Kubernetes resource patterns)
  • A scan-friendly cheatsheet for recurring monthly cost reviews

What “real savings” looks like

  • Fewer idle resources (paying for nothing)
  • Right-sized baselines (paying for what you actually use)
  • Shorter retention (logs/snapshots/artifacts aren’t forever by default)
  • Less surprise egress (network and NAT costs are visible and controlled)
  • Accountability (teams can see and own their spend)

Cost optimization is a product, not a cleanup sprint

The first month gets you wins. The long-term payoff comes from guardrails: defaults (tagging, retention), automation (schedules, policies), and a recurring review cadence that keeps leaks from returning.

Core concepts

FinOps: the “operating system” for cloud spend

FinOps is simply the collaboration layer between engineering, finance, and product that turns cloud bills into actionable signals. You don’t need a big program to start—you need clear ownership, good data (tags), and a habit of making tradeoffs visible.

The four cost drivers

Driver What causes spend Most common leak Fastest fix
Compute Instance size, runtime hours, autoscaling, baseline capacity Overprovisioned instances and always-on dev Right-size + schedules
Storage GB-month, snapshots, backup retention, hot vs cold tiers “Retain forever” defaults Lifecycle + retention policy
Network Egress, cross-AZ/region traffic, NAT gateway processing Chatty services through expensive paths Make egress visible + route efficiently
Managed services Provisioned capacity, IOPS, replicas, throughput units Oversized DBs and unused replicas Downsize + autoscale/auto-pause where appropriate

Unit economics: the metric that stops arguments

“We spent $X” is rarely useful on its own. Better: cost per unit (per request, per user, per job, per GB processed). Unit economics helps you decide whether to optimize, refactor, or accept spend as the price of growth.

Tags and allocation: you can’t fix what you can’t attribute

If you can’t map spend to a team/service/environment, optimization becomes guesswork. The simplest tagging strategy that works at scale is:

  • team: who owns the resource
  • service: what system it belongs to
  • env: prod / staging / dev
  • cost-center (optional): how finance groups it
The “visibility first” rule

The fastest savings usually comes from deleting unused things. The fastest repeatable savings comes from making costs visible to the people who can change them.

Risk-managed optimization

The most expensive cloud cost project is the one that causes downtime. Favor changes that are: reversible (easy rollback), measurable (you can see impact), and scoped (limited blast radius).

Step-by-step

This is a practical month plan for cloud cost optimization. You can run it with a small team and basic access to your billing data, cloud inventory, and metrics.

Week 1 — Visibility and guardrails

Do this

  • Create a top spend dashboard (top services, top projects/accounts, top teams)
  • Enable budgets + alerts for the biggest categories and for production accounts
  • Define tag standards (team/service/env) and enforce them for new resources
  • Set default log retention and require explicit exceptions

Watch out for

  • “One giant shared account/project” hiding ownership
  • Tagging that is optional (optional usually means “never”)
  • Dashboards with too much detail (start with top 10)
  • Alerts that spam (route to owners, tune thresholds)

Week 2 — Delete and downgrade (low-risk savings)

The goal is to find things that are provably unused or over-retained. This week is where most teams get quick wins without changing application code.

Inventory idle resources with a simple CLI sweep

This bash script is intentionally conservative: it lists candidates for review (it does not delete). Run it from a workstation or a CI job with read-only credentials. Adapt the commands to your cloud provider (the pattern is the same everywhere).

#!/usr/bin/env bash
set -euo pipefail

# Quick AWS inventory for common cost leaks (read-only).
# Requirements: aws CLI configured. Optional: jq for nicer output.
# Tip: start in non-prod accounts/projects.

echo "== Stopped EC2 instances (still pay for attached storage/EIPs) =="
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=stopped" \
  --query "Reservations[].Instances[].[InstanceId,InstanceType,State.Name,Tags]" \
  --output table || true

echo "== Unattached EBS volumes (often pure waste) =="
aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query "Volumes[].[VolumeId,Size,VolumeType,CreateTime,Tags]" \
  --output table || true

echo "== Elastic IPs not associated (hourly cost) =="
aws ec2 describe-addresses \
  --query "Addresses[?AssociationId==null].[PublicIp,AllocationId,Tags]" \
  --output table || true

echo "== Load balancers with no targets (common zombie) =="
# You may have ALB/NLB/CLB; start with target groups for ALB/NLB:
aws elbv2 describe-target-groups \
  --query "TargetGroups[].TargetGroupArn" --output text 2>/dev/null | while read -r arn; do
    count="$(aws elbv2 describe-target-health --target-group-arn "$arn" \
      --query "length(TargetHealthDescriptions)" --output text 2>/dev/null || echo 0)"
    if [[ "$count" == "0" ]]; then
      echo "Empty target group: $arn"
    fi
  done

echo "Done. Review candidates with owners before cleanup."
Make cleanup safe
  • Tag the resource with cleanup-candidate=true and a date before deletion
  • Notify owners and wait a short window (e.g., 3–7 days)
  • Prefer “disable/detach” first when unsure

Week 3 — Right-sizing (where big savings often live)

Right-sizing means aligning capacity with actual usage. The trap is right-sizing based on a single “quiet day.” Use 14–30 days of metrics, and always keep a safety margin for spikes.

Right-sizing checklist

  • Pick the top 5–10 workloads by cost (compute + DB + Kubernetes nodes)
  • Look at p95 usage (CPU, memory), not average
  • Reduce one notch at a time; validate with load/latency and error rates
  • Document the change and how to revert

Where right-sizing hides

  • Databases with provisioned IOPS or oversized replicas
  • Kubernetes requests that are 2–10× higher than real usage
  • Caches (Redis/Memcached) that were sized for “one incident” and never revisited
  • Always-on dev/test “just in case”

Spot missing tags and top cost drivers from a billing export

You don’t need a perfect FinOps platform to start. Export your billing data (CSV), then group spend by service and tag coverage. This Python script assumes a CSV with at least cost, service, and an optional team (or similar) column. Rename columns to match your export (e.g., CUR, BigQuery billing export, or a provider console download).

#!/usr/bin/env python3
import csv
import sys
from collections import defaultdict

# Usage:
#   python3 cost_report.py billing.csv
#
# Expected columns (rename to match your export):
#   - service: e.g., "Compute Engine", "EC2", "S3"
#   - cost: numeric (monthly or daily)
#   - team: optional tag/label (empty means "unallocated")

path = sys.argv[1] if len(sys.argv) > 1 else None
if not path:
  print("Usage: python3 cost_report.py <billing.csv>", file=sys.stderr)
  sys.exit(2)

by_service = defaultdict(float)
by_team = defaultdict(float)
unallocated = 0.0
total = 0.0

with open(path, newline="") as f:
  reader = csv.DictReader(f)
  for row in reader:
    try:
      cost = float(row.get("cost", "0") or 0)
    except ValueError:
      continue

    service = (row.get("service") or "unknown").strip() or "unknown"
    team = (row.get("team") or "").strip()

    total += cost
    by_service[service] += cost
    if team:
      by_team[team] += cost
    else:
      unallocated += cost

def top(d, n=10):
  return sorted(d.items(), key=lambda kv: kv[1], reverse=True)[:n]

print("== Top services ==")
for svc, c in top(by_service, 10):
  print(f"{svc:30s} {c:12.2f}")

print("\n== Top teams (allocated) ==")
for team, c in top(by_team, 10):
  print(f"{team:30s} {c:12.2f}")

print("\n== Allocation ==")
pct_unalloc = (unallocated / total * 100.0) if total > 0 else 0.0
print(f"Total cost:        {total:.2f}")
print(f"Unallocated cost:  {unallocated:.2f} ({pct_unalloc:.1f}%)")

if pct_unalloc > 10.0:
  print("\nTip: Add/standardize tags (team/service/env). Unallocated spend hides the biggest leaks.")

Week 4 — Commitments, scheduling, and “keep it fixed” automation

Once your baseline is reasonable, you can safely apply longer-term optimizations: reserved capacity / savings plans / committed use, autoscaling and schedules, and policy enforcement so old problems don’t creep back.

Commitments (use only after right-sizing)

  • Identify steady-state workloads (always-on prod capacity)
  • Start with a conservative coverage target (e.g., a portion of baseline)
  • Prefer flexibility when unsure (broader commitments over narrow instance types)
  • Revisit monthly as usage changes

Automation that prevents regression

  • Schedules for dev/test and ephemeral environments
  • Storage lifecycle policies and retention defaults
  • Policy-as-code: require tags and block obvious waste patterns
  • Monthly review: top changes, biggest anomalies, biggest unallocated spend

The 15 leaks (and what to do about them)

Use this as your backlog. Pick the top 3 leaks by impact in your environment and fix them end-to-end (visibility → owner → change → verification).

# Leak Where it shows up Fix this month
1Overprovisioned instances/VMsComputeRight-size using p95 metrics; validate performance
2Stopped instances with attached storageCompute + disksDelete/hibernate properly; snapshot if needed
3Unattached volumes / orphaned disksStorageIdentify “available/unattached”; tag + delete after review
4Unused public IPsNetworkRelease unassociated IPs; document reservation needs
5Zombie load balancers / empty target groupsNetworkRemove LBs with no traffic/targets; update DNS
6Snapshots and backups retained foreverStorageRetention policy per env; prune old snapshots safely
7Logs retained foreverObservabilityDefault retention; archive cold logs if required
8Object storage stuck in hot tierStorageLifecycle rules: infrequent access / archive / delete
9Container registry bloatArtifactsKeep N tags; delete unreferenced layers/images
10CI artifacts and caches never cleanedCI/CDRetention and size limits; purge old builds
11Chatty services through expensive pathsNetwork/NATReduce cross-AZ/region traffic; route locally
12Unexpected egress (downloads/exports)NetworkAlert on egress spikes; use CDN/cache; keep data local
13Oversized databases / replicasManaged DBRight-size, remove unused replicas, scale storage properly
14Kubernetes request inflationKubernetesSet sane requests/limits; autoscale; avoid idle nodes
15Missing tags / unallocated spendAllEnforce tags; fix top offenders; add ownership
Why leaks come back

Most leaks return because the default path is wasteful (no tags, infinite retention, always-on environments). If you fix the defaults, you stop paying the same “cloud tax” every quarter.

Kubernetes cost control: requests, limits, and autoscaling

In Kubernetes, overestimated requests lead to wasted nodes. Underestimated requests lead to throttling and instability. The pattern below sets a reasonable baseline and lets an autoscaler handle bursts. Tune values with real usage metrics.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 2
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: registry.example.com/api:1.2.3
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "200m"
              memory: "256Mi"
            limits:
              cpu: "1000m"
              memory: "512Mi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
How to right-size Kubernetes safely
  • Start by setting realistic requests (based on p95 usage) before touching limits
  • Keep some headroom; let the HPA scale pods rather than over-requesting resources
  • Watch node utilization and pod evictions after changes

Common mistakes

Most cost optimization efforts fail for predictable reasons: they optimize the wrong thing, break something, or don’t stick. Here are the pitfalls to avoid (and what to do instead).

Mistake: Optimizing before you have ownership

If no one owns spend, no one maintains the fix.

  • Fix: tag standards (team/service/env) + an “unallocated spend” target.
  • Fix: route alerts to owners, not a shared inbox.

Mistake: Deleting resources without verifying impact

A cheap win that causes downtime becomes an expensive incident.

  • Fix: identify last access, check dependencies, and snapshot first when unsure.
  • Fix: use “quarantine tags” and short review windows.

Mistake: Right-sizing based on averages

Averages hide spikes; spikes are what break production.

  • Fix: use p95/p99, add headroom, and downsize in steps.
  • Fix: validate with latency/error signals (not just “CPU looks fine”).

Mistake: Buying commitments too early

Commitments lock in waste if the baseline is oversized.

  • Fix: right-size first, then commit to steady-state capacity.
  • Fix: start conservative and increase coverage gradually.

Mistake: Ignoring network and data transfer

Egress and NAT costs can quietly rival compute.

  • Fix: make egress a first-class metric and alert on spikes.
  • Fix: reduce cross-zone/region chatter; use caching/CDNs where appropriate.

Mistake: “One-time cleanup” with no guardrails

Leaks return because defaults stay wasteful.

  • Fix: enforce tags, retention, and schedules by policy.
  • Fix: add a monthly cost review cadence and track regression.
A simple success metric

If you can answer “what changed” when spend moves (up or down), your cost program is working. If every spike is a mystery, fix allocation and visibility first.

FAQ

What is cloud cost optimization, in practical terms?

Cloud cost optimization is the discipline of reducing waste while keeping reliability and performance intact. Practically, it means deleting unused resources, right-sizing what remains, using the right storage tiers and retention, managing data transfer, and adding guardrails so leaks don’t reappear.

Where do most teams find the first meaningful savings?

Usually in idle resources (stopped instances, unattached volumes, unused IPs/load balancers), over-retained data (logs/snapshots/artifacts), and overprovisioned compute. These are measurable and typically low-risk when handled with ownership and rollback plans.

How do we right-size safely without causing outages?

Use 14–30 days of metrics, size to p95 (not average), keep headroom, and make changes in small steps. Validate with user-facing signals (latency/error rate) and always document a rollback path.

Should we focus on reserved instances / savings plans / committed use?

Yes, but after you fix the baseline. Commitments amplify whatever baseline you have: if it’s oversized, you lock in waste. Start conservative on steady-state capacity, prefer flexible commitments when uncertain, and revisit monthly.

How important are tags for cost optimization?

Tags (or labels) are foundational. Without allocation, you can’t answer “who owns this spend,” and optimization becomes slow and political. Aim to reduce “unallocated” spend steadily and enforce tagging on new resources.

What are the biggest “hidden” cloud cost drivers?

Data transfer/egress, NAT gateway processing, managed database replicas and provisioned IOPS, log ingestion and retention, and Kubernetes request inflation are common surprises. The fix is usually visibility (dashboards/alerts) plus a small design change (routing, caching, autoscaling, retention).

How often should we review cloud spend?

At least monthly for a comprehensive review, with weekly checks for anomalies in high-spend environments. A lightweight cadence that works: weekly “top anomalies” plus a monthly “top services + top teams + unallocated spend” review.

Cheatsheet

Use this for a monthly cloud cost optimization review. It’s intentionally compact: the goal is to spot the biggest leaks fast and assign owners.

Monthly review (30–45 minutes)

  • Top 10 services by cost (trend up/down)
  • Top 10 projects/accounts by cost
  • Top 10 teams/services by cost (requires tags)
  • Unallocated spend % (missing tags) and top offenders
  • Biggest anomalies (day/week spikes)

Weekly quick checks (10 minutes)

  • New resources without required tags
  • New public IPs / load balancers without traffic
  • Egress and NAT costs spike check
  • Log ingestion growth and retention exceptions
  • Dev/test schedules still enforced

“Before cleanup” safety checklist

Resource type Verify Safe first move
Instances/VMs Owner + last CPU/network activity + attached volumes Stop in non-prod; snapshot critical disks
Volumes/disks Attachment status + last read/write Tag “cleanup-candidate”; snapshot before delete
Load balancers Targets + traffic + DNS references Disable listeners / remove DNS, then delete
Snapshots/backups Retention policy + restore test for critical paths Implement lifecycle; keep recent restore points
Logs/artifacts Compliance/incident requirements Reduce retention + archive cold storage if needed
The most repeatable cost “win”

Make the default path cost-aware: enforce tags, enforce retention, and enforce schedules for non-prod. Then you only hunt exceptions.

Wrap-up

Cloud cost optimization doesn’t require a massive rewrite. In your first month, you can save real money by fixing the obvious leaks: delete unused resources, reduce retention, right-size the biggest workloads, make egress visible, and enforce ownership with tags. The lasting payoff comes from guardrails that prevent regressions.

Your next actions

  • Today: enable budgets/alerts, set log retention defaults, and run an idle resource inventory
  • This week: clean up 3 low-risk leak categories (unattached disks, unused IPs, zombie LBs)
  • This month: right-size the top 5 spenders and reduce “unallocated spend” via enforced tags
  • Ongoing: monthly review cadence + automation for schedules and lifecycles

If you want to make these changes safer and more repeatable, pair cost work with infrastructure-as-code and runbooks. The related posts below cover Terraform pitfalls, CI/CD patterns, and Kubernetes basics that help keep cost changes controlled.

Quiz

Quick self-check (demo). This quiz is auto-generated for cloud / devops / cloud.

1) What’s the most reliable first step in cloud cost optimization?
2) Which category is typically a low-risk “quick win”?
3) Why is “right-size based on average CPU” a common mistake?
4) When do commitments (RIs/Savings Plans/Committed Use) make the most sense?