Cloud Cost Optimization: 15 Leaks You Can Fix This Month

By Samuel Labant Published Jan 9, 2026 Updated Jan 9, 2026

Cloud cost optimization is rarely about one “big lever.” It’s about stopping dozens of small leaks: overprovisioned compute, storage that never expires, idle environments, surprise data transfer, and services no one remembers creating. The good news: you can get real savings this month without risky refactors.

Quickstart

If you only have an hour, start here. These steps prioritize “low-risk, high-confidence” savings: things that are obviously unused, obviously oversized, or missing guardrails.

Day 1: Stop the bleeding (60–90 minutes)

Turn on cost alerts: a basic budget + anomaly detection (per account/project and for the biggest services)
Find idle resources: stopped VMs with attached disks, unattached volumes, unused IPs, orphaned load balancers
Set log retention: avoid “forever logs” (pick a default retention that matches your compliance needs)
Add a “Who owns this?” tag: team/service/env on new resources (even before you fix the past)

Day 2–7: Quick wins that usually pay back

Right-size the top 5 compute spenders using 14–30 days of metrics (CPU, memory, and disk I/O)
Storage lifecycle: move older objects/snapshots to cheaper tiers or delete by policy
Kill zombie images in registries and stale artifacts in CI storage
Schedule dev/test to shut down nights/weekends (or migrate to on-demand ephemeral envs)

Triage order: maximize savings with minimal risk

Unused (safe to delete after verifying) → IPs, unattached disks, empty load balancers
Over-retained (safe to reduce) → logs, snapshots, artifacts
Overprovisioned (safe to tune) → instances, databases, Kubernetes requests
Architectural (bigger projects) → data transfer patterns, NAT gateways, chatty services

Before you delete anything

Verify owner + last access (tags, audit logs, metrics, or service catalog)
Take a snapshot/backup if the blast radius is unclear
Prefer “disable / detach / quarantine” over “hard delete” for the first pass
Document what you changed and how to roll back

Overview

“Cloud spend” feels mysterious when it arrives as a single monthly invoice. Cloud cost optimization becomes straightforward once you split costs into drivers (compute, storage, network, managed services) and ownership (teams/services/environments). This post gives you a month-long plan and a list of 15 common leaks you can fix with minimal drama.

What you’ll walk away with

A practical sequence: visibility → cleanup → right-sizing → commitments → guardrails
15 cost leaks (compute, storage, network, Kubernetes, process) with fixes and gotchas
Templates you can copy into your workflow (audit scripts, tagging discipline, Kubernetes resource patterns)
A scan-friendly cheatsheet for recurring monthly cost reviews

What “real savings” looks like

Fewer idle resources (paying for nothing)
Right-sized baselines (paying for what you actually use)
Shorter retention (logs/snapshots/artifacts aren’t forever by default)
Less surprise egress (network and NAT costs are visible and controlled)
Accountability (teams can see and own their spend)

Cost optimization is a product, not a cleanup sprint

The first month gets you wins. The long-term payoff comes from guardrails: defaults (tagging, retention), automation (schedules, policies), and a recurring review cadence that keeps leaks from returning.

Core concepts

FinOps: the “operating system” for cloud spend

FinOps is simply the collaboration layer between engineering, finance, and product that turns cloud bills into actionable signals. You don’t need a big program to start—you need clear ownership, good data (tags), and a habit of making tradeoffs visible.

The four cost drivers

Driver	What causes spend	Most common leak	Fastest fix
Compute	Instance size, runtime hours, autoscaling, baseline capacity	Overprovisioned instances and always-on dev	Right-size + schedules
Storage	GB-month, snapshots, backup retention, hot vs cold tiers	“Retain forever” defaults	Lifecycle + retention policy
Network	Egress, cross-AZ/region traffic, NAT gateway processing	Chatty services through expensive paths	Make egress visible + route efficiently
Managed services	Provisioned capacity, IOPS, replicas, throughput units	Oversized DBs and unused replicas	Downsize + autoscale/auto-pause where appropriate

Unit economics: the metric that stops arguments

“We spent $X” is rarely useful on its own. Better: cost per unit (per request, per user, per job, per GB processed). Unit economics helps you decide whether to optimize, refactor, or accept spend as the price of growth.

Tags and allocation: you can’t fix what you can’t attribute

If you can’t map spend to a team/service/environment, optimization becomes guesswork. The simplest tagging strategy that works at scale is:

team: who owns the resource
service: what system it belongs to
env: prod / staging / dev
cost-center (optional): how finance groups it

The “visibility first” rule

The fastest savings usually comes from deleting unused things. The fastest repeatable savings comes from making costs visible to the people who can change them.

Risk-managed optimization

The most expensive cloud cost project is the one that causes downtime. Favor changes that are: reversible (easy rollback), measurable (you can see impact), and scoped (limited blast radius).

Step-by-step

This is a practical month plan for cloud cost optimization. You can run it with a small team and basic access to your billing data, cloud inventory, and metrics.

Week 1 — Visibility and guardrails

Do this

Create a top spend dashboard (top services, top projects/accounts, top teams)
Enable budgets + alerts for the biggest categories and for production accounts
Define tag standards (team/service/env) and enforce them for new resources
Set default log retention and require explicit exceptions

Watch out for

“One giant shared account/project” hiding ownership
Tagging that is optional (optional usually means “never”)
Dashboards with too much detail (start with top 10)
Alerts that spam (route to owners, tune thresholds)

Week 2 — Delete and downgrade (low-risk savings)

The goal is to find things that are provably unused or over-retained. This week is where most teams get quick wins without changing application code.

Inventory idle resources with a simple CLI sweep

This bash script is intentionally conservative: it lists candidates for review (it does not delete). Run it from a workstation or a CI job with read-only credentials. Adapt the commands to your cloud provider (the pattern is the same everywhere).

#!/usr/bin/env bash
set -euo pipefail

# Quick AWS inventory for common cost leaks (read-only).
# Requirements: aws CLI configured. Optional: jq for nicer output.
# Tip: start in non-prod accounts/projects.

echo "== Stopped EC2 instances (still pay for attached storage/EIPs) =="
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=stopped" \
  --query "Reservations[].Instances[].[InstanceId,InstanceType,State.Name,Tags]" \
  --output table || true

echo "== Unattached EBS volumes (often pure waste) =="
aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query "Volumes[].[VolumeId,Size,VolumeType,CreateTime,Tags]" \
  --output table || true

echo "== Elastic IPs not associated (hourly cost) =="
aws ec2 describe-addresses \
  --query "Addresses[?AssociationId==null].[PublicIp,AllocationId,Tags]" \
  --output table || true

echo "== Load balancers with no targets (common zombie) =="
# You may have ALB/NLB/CLB; start with target groups for ALB/NLB:
aws elbv2 describe-target-groups \
  --query "TargetGroups[].TargetGroupArn" --output text 2>/dev/null | while read -r arn; do
    count="$(aws elbv2 describe-target-health --target-group-arn "$arn" \
      --query "length(TargetHealthDescriptions)" --output text 2>/dev/null || echo 0)"
    if [[ "$count" == "0" ]]; then
      echo "Empty target group: $arn"
    fi
  done

echo "Done. Review candidates with owners before cleanup."

Make cleanup safe

Tag the resource with cleanup-candidate=true and a date before deletion
Notify owners and wait a short window (e.g., 3–7 days)
Prefer “disable/detach” first when unsure

Week 3 — Right-sizing (where big savings often live)

Right-sizing means aligning capacity with actual usage. The trap is right-sizing based on a single “quiet day.” Use 14–30 days of metrics, and always keep a safety margin for spikes.

Right-sizing checklist

Pick the top 5–10 workloads by cost (compute + DB + Kubernetes nodes)
Look at p95 usage (CPU, memory), not average
Reduce one notch at a time; validate with load/latency and error rates
Document the change and how to revert

Where right-sizing hides

Databases with provisioned IOPS or oversized replicas
Kubernetes requests that are 2–10× higher than real usage
Caches (Redis/Memcached) that were sized for “one incident” and never revisited
Always-on dev/test “just in case”

Spot missing tags and top cost drivers from a billing export

You don’t need a perfect FinOps platform to start. Export your billing data (CSV), then group spend by service and tag coverage. This Python script assumes a CSV with at least cost, service, and an optional team (or similar) column. Rename columns to match your export (e.g., CUR, BigQuery billing export, or a provider console download).

#!/usr/bin/env python3
import csv
import sys
from collections import defaultdict

# Usage:
#   python3 cost_report.py billing.csv
#
# Expected columns (rename to match your export):
#   - service: e.g., "Compute Engine", "EC2", "S3"
#   - cost: numeric (monthly or daily)
#   - team: optional tag/label (empty means "unallocated")

path = sys.argv[1] if len(sys.argv) > 1 else None
if not path:
  print("Usage: python3 cost_report.py <billing.csv>", file=sys.stderr)
  sys.exit(2)

by_service = defaultdict(float)
by_team = defaultdict(float)
unallocated = 0.0
total = 0.0

with open(path, newline="") as f:
  reader = csv.DictReader(f)
  for row in reader:
    try:
      cost = float(row.get("cost", "0") or 0)
    except ValueError:
      continue

    service = (row.get("service") or "unknown").strip() or "unknown"
    team = (row.get("team") or "").strip()

    total += cost
    by_service[service] += cost
    if team:
      by_team[team] += cost
    else:
      unallocated += cost

def top(d, n=10):
  return sorted(d.items(), key=lambda kv: kv[1], reverse=True)[:n]

print("== Top services ==")
for svc, c in top(by_service, 10):
  print(f"{svc:30s} {c:12.2f}")

print("\n== Top teams (allocated) ==")
for team, c in top(by_team, 10):
  print(f"{team:30s} {c:12.2f}")

print("\n== Allocation ==")
pct_unalloc = (unallocated / total * 100.0) if total > 0 else 0.0
print(f"Total cost:        {total:.2f}")
print(f"Unallocated cost:  {unallocated:.2f} ({pct_unalloc:.1f}%)")

if pct_unalloc > 10.0:
  print("\nTip: Add/standardize tags (team/service/env). Unallocated spend hides the biggest leaks.")

Week 4 — Commitments, scheduling, and “keep it fixed” automation

Once your baseline is reasonable, you can safely apply longer-term optimizations: reserved capacity / savings plans / committed use, autoscaling and schedules, and policy enforcement so old problems don’t creep back.

Commitments (use only after right-sizing)

Identify steady-state workloads (always-on prod capacity)
Start with a conservative coverage target (e.g., a portion of baseline)
Prefer flexibility when unsure (broader commitments over narrow instance types)
Revisit monthly as usage changes

Automation that prevents regression

Schedules for dev/test and ephemeral environments
Storage lifecycle policies and retention defaults
Policy-as-code: require tags and block obvious waste patterns
Monthly review: top changes, biggest anomalies, biggest unallocated spend

The 15 leaks (and what to do about them)

Use this as your backlog. Pick the top 3 leaks by impact in your environment and fix them end-to-end (visibility → owner → change → verification).

#	Leak	Where it shows up	Fix this month
1	Overprovisioned instances/VMs	Compute	Right-size using p95 metrics; validate performance
2	Stopped instances with attached storage	Compute + disks	Delete/hibernate properly; snapshot if needed
3	Unattached volumes / orphaned disks	Storage	Identify “available/unattached”; tag + delete after review
4	Unused public IPs	Network	Release unassociated IPs; document reservation needs
5	Zombie load balancers / empty target groups	Network	Remove LBs with no traffic/targets; update DNS
6	Snapshots and backups retained forever	Storage	Retention policy per env; prune old snapshots safely
7	Logs retained forever	Observability	Default retention; archive cold logs if required
8	Object storage stuck in hot tier	Storage	Lifecycle rules: infrequent access / archive / delete
9	Container registry bloat	Artifacts	Keep N tags; delete unreferenced layers/images
10	CI artifacts and caches never cleaned	CI/CD	Retention and size limits; purge old builds
11	Chatty services through expensive paths	Network/NAT	Reduce cross-AZ/region traffic; route locally
12	Unexpected egress (downloads/exports)	Network	Alert on egress spikes; use CDN/cache; keep data local
13	Oversized databases / replicas	Managed DB	Right-size, remove unused replicas, scale storage properly
14	Kubernetes request inflation	Kubernetes	Set sane requests/limits; autoscale; avoid idle nodes
15	Missing tags / unallocated spend	All	Enforce tags; fix top offenders; add ownership

Why leaks come back

Most leaks return because the default path is wasteful (no tags, infinite retention, always-on environments). If you fix the defaults, you stop paying the same “cloud tax” every quarter.

Kubernetes cost control: requests, limits, and autoscaling

In Kubernetes, overestimated requests lead to wasted nodes. Underestimated requests lead to throttling and instability. The pattern below sets a reasonable baseline and lets an autoscaler handle bursts. Tune values with real usage metrics.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 2
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: registry.example.com/api:1.2.3
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "200m"
              memory: "256Mi"
            limits:
              cpu: "1000m"
              memory: "512Mi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

How to right-size Kubernetes safely

Start by setting realistic requests (based on p95 usage) before touching limits
Keep some headroom; let the HPA scale pods rather than over-requesting resources
Watch node utilization and pod evictions after changes

Common mistakes

Most cost optimization efforts fail for predictable reasons: they optimize the wrong thing, break something, or don’t stick. Here are the pitfalls to avoid (and what to do instead).

Mistake: Optimizing before you have ownership

If no one owns spend, no one maintains the fix.

Fix: tag standards (team/service/env) + an “unallocated spend” target.
Fix: route alerts to owners, not a shared inbox.

Mistake: Deleting resources without verifying impact

A cheap win that causes downtime becomes an expensive incident.

Fix: identify last access, check dependencies, and snapshot first when unsure.
Fix: use “quarantine tags” and short review windows.

Mistake: Right-sizing based on averages

Averages hide spikes; spikes are what break production.

Fix: use p95/p99, add headroom, and downsize in steps.
Fix: validate with latency/error signals (not just “CPU looks fine”).

Mistake: Buying commitments too early

Commitments lock in waste if the baseline is oversized.

Fix: right-size first, then commit to steady-state capacity.
Fix: start conservative and increase coverage gradually.

Mistake: Ignoring network and data transfer

Egress and NAT costs can quietly rival compute.

Fix: make egress a first-class metric and alert on spikes.
Fix: reduce cross-zone/region chatter; use caching/CDNs where appropriate.

Mistake: “One-time cleanup” with no guardrails

Leaks return because defaults stay wasteful.

Fix: enforce tags, retention, and schedules by policy.
Fix: add a monthly cost review cadence and track regression.

A simple success metric

If you can answer “what changed” when spend moves (up or down), your cost program is working. If every spike is a mystery, fix allocation and visibility first.

FAQ

What is cloud cost optimization, in practical terms?

Cloud cost optimization is the discipline of reducing waste while keeping reliability and performance intact. Practically, it means deleting unused resources, right-sizing what remains, using the right storage tiers and retention, managing data transfer, and adding guardrails so leaks don’t reappear.

Where do most teams find the first meaningful savings?

Usually in idle resources (stopped instances, unattached volumes, unused IPs/load balancers), over-retained data (logs/snapshots/artifacts), and overprovisioned compute. These are measurable and typically low-risk when handled with ownership and rollback plans.

How do we right-size safely without causing outages?

Use 14–30 days of metrics, size to p95 (not average), keep headroom, and make changes in small steps. Validate with user-facing signals (latency/error rate) and always document a rollback path.

Should we focus on reserved instances / savings plans / committed use?

Yes, but after you fix the baseline. Commitments amplify whatever baseline you have: if it’s oversized, you lock in waste. Start conservative on steady-state capacity, prefer flexible commitments when uncertain, and revisit monthly.

How important are tags for cost optimization?

Tags (or labels) are foundational. Without allocation, you can’t answer “who owns this spend,” and optimization becomes slow and political. Aim to reduce “unallocated” spend steadily and enforce tagging on new resources.

What are the biggest “hidden” cloud cost drivers?

Data transfer/egress, NAT gateway processing, managed database replicas and provisioned IOPS, log ingestion and retention, and Kubernetes request inflation are common surprises. The fix is usually visibility (dashboards/alerts) plus a small design change (routing, caching, autoscaling, retention).

How often should we review cloud spend?

At least monthly for a comprehensive review, with weekly checks for anomalies in high-spend environments. A lightweight cadence that works: weekly “top anomalies” plus a monthly “top services + top teams + unallocated spend” review.

Cheatsheet

Use this for a monthly cloud cost optimization review. It’s intentionally compact: the goal is to spot the biggest leaks fast and assign owners.

Monthly review (30–45 minutes)

Top 10 services by cost (trend up/down)
Top 10 projects/accounts by cost
Top 10 teams/services by cost (requires tags)
Unallocated spend % (missing tags) and top offenders
Biggest anomalies (day/week spikes)

Weekly quick checks (10 minutes)

New resources without required tags
New public IPs / load balancers without traffic
Egress and NAT costs spike check
Log ingestion growth and retention exceptions
Dev/test schedules still enforced

“Before cleanup” safety checklist

Resource type	Verify	Safe first move
Instances/VMs	Owner + last CPU/network activity + attached volumes	Stop in non-prod; snapshot critical disks
Volumes/disks	Attachment status + last read/write	Tag “cleanup-candidate”; snapshot before delete
Load balancers	Targets + traffic + DNS references	Disable listeners / remove DNS, then delete
Snapshots/backups	Retention policy + restore test for critical paths	Implement lifecycle; keep recent restore points
Logs/artifacts	Compliance/incident requirements	Reduce retention + archive cold storage if needed

The most repeatable cost “win”

Make the default path cost-aware: enforce tags, enforce retention, and enforce schedules for non-prod. Then you only hunt exceptions.

Wrap-up

Cloud cost optimization doesn’t require a massive rewrite. In your first month, you can save real money by fixing the obvious leaks: delete unused resources, reduce retention, right-size the biggest workloads, make egress visible, and enforce ownership with tags. The lasting payoff comes from guardrails that prevent regressions.

Your next actions

Today: enable budgets/alerts, set log retention defaults, and run an idle resource inventory
This week: clean up 3 low-risk leak categories (unattached disks, unused IPs, zombie LBs)
This month: right-size the top 5 spenders and reduce “unallocated spend” via enforced tags
Ongoing: monthly review cadence + automation for schedules and lifecycles

If you want to make these changes safer and more repeatable, pair cost work with infrastructure-as-code and runbooks. The related posts below cover Terraform pitfalls, CI/CD patterns, and Kubernetes basics that help keep cost changes controlled.

UniLab Editorial

Modern learning notes for practical builders.