Cloud cost optimization is rarely about one “big lever.” It’s about stopping dozens of small leaks: overprovisioned compute, storage that never expires, idle environments, surprise data transfer, and services no one remembers creating. The good news: you can get real savings this month without risky refactors.
Quickstart
If you only have an hour, start here. These steps prioritize “low-risk, high-confidence” savings: things that are obviously unused, obviously oversized, or missing guardrails.
Day 1: Stop the bleeding (60–90 minutes)
- Turn on cost alerts: a basic budget + anomaly detection (per account/project and for the biggest services)
- Find idle resources: stopped VMs with attached disks, unattached volumes, unused IPs, orphaned load balancers
- Set log retention: avoid “forever logs” (pick a default retention that matches your compliance needs)
- Add a “Who owns this?” tag: team/service/env on new resources (even before you fix the past)
Day 2–7: Quick wins that usually pay back
- Right-size the top 5 compute spenders using 14–30 days of metrics (CPU, memory, and disk I/O)
- Storage lifecycle: move older objects/snapshots to cheaper tiers or delete by policy
- Kill zombie images in registries and stale artifacts in CI storage
- Schedule dev/test to shut down nights/weekends (or migrate to on-demand ephemeral envs)
- Unused (safe to delete after verifying) → IPs, unattached disks, empty load balancers
- Over-retained (safe to reduce) → logs, snapshots, artifacts
- Overprovisioned (safe to tune) → instances, databases, Kubernetes requests
- Architectural (bigger projects) → data transfer patterns, NAT gateways, chatty services
- Verify owner + last access (tags, audit logs, metrics, or service catalog)
- Take a snapshot/backup if the blast radius is unclear
- Prefer “disable / detach / quarantine” over “hard delete” for the first pass
- Document what you changed and how to roll back
Overview
“Cloud spend” feels mysterious when it arrives as a single monthly invoice. Cloud cost optimization becomes straightforward once you split costs into drivers (compute, storage, network, managed services) and ownership (teams/services/environments). This post gives you a month-long plan and a list of 15 common leaks you can fix with minimal drama.
What you’ll walk away with
- A practical sequence: visibility → cleanup → right-sizing → commitments → guardrails
- 15 cost leaks (compute, storage, network, Kubernetes, process) with fixes and gotchas
- Templates you can copy into your workflow (audit scripts, tagging discipline, Kubernetes resource patterns)
- A scan-friendly cheatsheet for recurring monthly cost reviews
What “real savings” looks like
- Fewer idle resources (paying for nothing)
- Right-sized baselines (paying for what you actually use)
- Shorter retention (logs/snapshots/artifacts aren’t forever by default)
- Less surprise egress (network and NAT costs are visible and controlled)
- Accountability (teams can see and own their spend)
Cost optimization is a product, not a cleanup sprint
The first month gets you wins. The long-term payoff comes from guardrails: defaults (tagging, retention), automation (schedules, policies), and a recurring review cadence that keeps leaks from returning.
Core concepts
FinOps: the “operating system” for cloud spend
FinOps is simply the collaboration layer between engineering, finance, and product that turns cloud bills into actionable signals. You don’t need a big program to start—you need clear ownership, good data (tags), and a habit of making tradeoffs visible.
The four cost drivers
| Driver | What causes spend | Most common leak | Fastest fix |
|---|---|---|---|
| Compute | Instance size, runtime hours, autoscaling, baseline capacity | Overprovisioned instances and always-on dev | Right-size + schedules |
| Storage | GB-month, snapshots, backup retention, hot vs cold tiers | “Retain forever” defaults | Lifecycle + retention policy |
| Network | Egress, cross-AZ/region traffic, NAT gateway processing | Chatty services through expensive paths | Make egress visible + route efficiently |
| Managed services | Provisioned capacity, IOPS, replicas, throughput units | Oversized DBs and unused replicas | Downsize + autoscale/auto-pause where appropriate |
Unit economics: the metric that stops arguments
“We spent $X” is rarely useful on its own. Better: cost per unit (per request, per user, per job, per GB processed). Unit economics helps you decide whether to optimize, refactor, or accept spend as the price of growth.
Tags and allocation: you can’t fix what you can’t attribute
If you can’t map spend to a team/service/environment, optimization becomes guesswork. The simplest tagging strategy that works at scale is:
- team: who owns the resource
- service: what system it belongs to
- env: prod / staging / dev
- cost-center (optional): how finance groups it
The fastest savings usually comes from deleting unused things. The fastest repeatable savings comes from making costs visible to the people who can change them.
Risk-managed optimization
The most expensive cloud cost project is the one that causes downtime. Favor changes that are: reversible (easy rollback), measurable (you can see impact), and scoped (limited blast radius).
Step-by-step
This is a practical month plan for cloud cost optimization. You can run it with a small team and basic access to your billing data, cloud inventory, and metrics.
Week 1 — Visibility and guardrails
Do this
- Create a top spend dashboard (top services, top projects/accounts, top teams)
- Enable budgets + alerts for the biggest categories and for production accounts
- Define tag standards (team/service/env) and enforce them for new resources
- Set default log retention and require explicit exceptions
Watch out for
- “One giant shared account/project” hiding ownership
- Tagging that is optional (optional usually means “never”)
- Dashboards with too much detail (start with top 10)
- Alerts that spam (route to owners, tune thresholds)
Week 2 — Delete and downgrade (low-risk savings)
The goal is to find things that are provably unused or over-retained. This week is where most teams get quick wins without changing application code.
Inventory idle resources with a simple CLI sweep
This bash script is intentionally conservative: it lists candidates for review (it does not delete). Run it from a workstation or a CI job with read-only credentials. Adapt the commands to your cloud provider (the pattern is the same everywhere).
#!/usr/bin/env bash
set -euo pipefail
# Quick AWS inventory for common cost leaks (read-only).
# Requirements: aws CLI configured. Optional: jq for nicer output.
# Tip: start in non-prod accounts/projects.
echo "== Stopped EC2 instances (still pay for attached storage/EIPs) =="
aws ec2 describe-instances \
--filters "Name=instance-state-name,Values=stopped" \
--query "Reservations[].Instances[].[InstanceId,InstanceType,State.Name,Tags]" \
--output table || true
echo "== Unattached EBS volumes (often pure waste) =="
aws ec2 describe-volumes \
--filters "Name=status,Values=available" \
--query "Volumes[].[VolumeId,Size,VolumeType,CreateTime,Tags]" \
--output table || true
echo "== Elastic IPs not associated (hourly cost) =="
aws ec2 describe-addresses \
--query "Addresses[?AssociationId==null].[PublicIp,AllocationId,Tags]" \
--output table || true
echo "== Load balancers with no targets (common zombie) =="
# You may have ALB/NLB/CLB; start with target groups for ALB/NLB:
aws elbv2 describe-target-groups \
--query "TargetGroups[].TargetGroupArn" --output text 2>/dev/null | while read -r arn; do
count="$(aws elbv2 describe-target-health --target-group-arn "$arn" \
--query "length(TargetHealthDescriptions)" --output text 2>/dev/null || echo 0)"
if [[ "$count" == "0" ]]; then
echo "Empty target group: $arn"
fi
done
echo "Done. Review candidates with owners before cleanup."
- Tag the resource with cleanup-candidate=true and a date before deletion
- Notify owners and wait a short window (e.g., 3–7 days)
- Prefer “disable/detach” first when unsure
Week 3 — Right-sizing (where big savings often live)
Right-sizing means aligning capacity with actual usage. The trap is right-sizing based on a single “quiet day.” Use 14–30 days of metrics, and always keep a safety margin for spikes.
Right-sizing checklist
- Pick the top 5–10 workloads by cost (compute + DB + Kubernetes nodes)
- Look at p95 usage (CPU, memory), not average
- Reduce one notch at a time; validate with load/latency and error rates
- Document the change and how to revert
Where right-sizing hides
- Databases with provisioned IOPS or oversized replicas
- Kubernetes requests that are 2–10× higher than real usage
- Caches (Redis/Memcached) that were sized for “one incident” and never revisited
- Always-on dev/test “just in case”
Spot missing tags and top cost drivers from a billing export
You don’t need a perfect FinOps platform to start. Export your billing data (CSV), then group spend by service and tag coverage. This Python script assumes a CSV with at least cost, service, and an optional team (or similar) column. Rename columns to match your export (e.g., CUR, BigQuery billing export, or a provider console download).
#!/usr/bin/env python3
import csv
import sys
from collections import defaultdict
# Usage:
# python3 cost_report.py billing.csv
#
# Expected columns (rename to match your export):
# - service: e.g., "Compute Engine", "EC2", "S3"
# - cost: numeric (monthly or daily)
# - team: optional tag/label (empty means "unallocated")
path = sys.argv[1] if len(sys.argv) > 1 else None
if not path:
print("Usage: python3 cost_report.py <billing.csv>", file=sys.stderr)
sys.exit(2)
by_service = defaultdict(float)
by_team = defaultdict(float)
unallocated = 0.0
total = 0.0
with open(path, newline="") as f:
reader = csv.DictReader(f)
for row in reader:
try:
cost = float(row.get("cost", "0") or 0)
except ValueError:
continue
service = (row.get("service") or "unknown").strip() or "unknown"
team = (row.get("team") or "").strip()
total += cost
by_service[service] += cost
if team:
by_team[team] += cost
else:
unallocated += cost
def top(d, n=10):
return sorted(d.items(), key=lambda kv: kv[1], reverse=True)[:n]
print("== Top services ==")
for svc, c in top(by_service, 10):
print(f"{svc:30s} {c:12.2f}")
print("\n== Top teams (allocated) ==")
for team, c in top(by_team, 10):
print(f"{team:30s} {c:12.2f}")
print("\n== Allocation ==")
pct_unalloc = (unallocated / total * 100.0) if total > 0 else 0.0
print(f"Total cost: {total:.2f}")
print(f"Unallocated cost: {unallocated:.2f} ({pct_unalloc:.1f}%)")
if pct_unalloc > 10.0:
print("\nTip: Add/standardize tags (team/service/env). Unallocated spend hides the biggest leaks.")
Week 4 — Commitments, scheduling, and “keep it fixed” automation
Once your baseline is reasonable, you can safely apply longer-term optimizations: reserved capacity / savings plans / committed use, autoscaling and schedules, and policy enforcement so old problems don’t creep back.
Commitments (use only after right-sizing)
- Identify steady-state workloads (always-on prod capacity)
- Start with a conservative coverage target (e.g., a portion of baseline)
- Prefer flexibility when unsure (broader commitments over narrow instance types)
- Revisit monthly as usage changes
Automation that prevents regression
- Schedules for dev/test and ephemeral environments
- Storage lifecycle policies and retention defaults
- Policy-as-code: require tags and block obvious waste patterns
- Monthly review: top changes, biggest anomalies, biggest unallocated spend
The 15 leaks (and what to do about them)
Use this as your backlog. Pick the top 3 leaks by impact in your environment and fix them end-to-end (visibility → owner → change → verification).
| # | Leak | Where it shows up | Fix this month |
|---|---|---|---|
| 1 | Overprovisioned instances/VMs | Compute | Right-size using p95 metrics; validate performance |
| 2 | Stopped instances with attached storage | Compute + disks | Delete/hibernate properly; snapshot if needed |
| 3 | Unattached volumes / orphaned disks | Storage | Identify “available/unattached”; tag + delete after review |
| 4 | Unused public IPs | Network | Release unassociated IPs; document reservation needs |
| 5 | Zombie load balancers / empty target groups | Network | Remove LBs with no traffic/targets; update DNS |
| 6 | Snapshots and backups retained forever | Storage | Retention policy per env; prune old snapshots safely |
| 7 | Logs retained forever | Observability | Default retention; archive cold logs if required |
| 8 | Object storage stuck in hot tier | Storage | Lifecycle rules: infrequent access / archive / delete |
| 9 | Container registry bloat | Artifacts | Keep N tags; delete unreferenced layers/images |
| 10 | CI artifacts and caches never cleaned | CI/CD | Retention and size limits; purge old builds |
| 11 | Chatty services through expensive paths | Network/NAT | Reduce cross-AZ/region traffic; route locally |
| 12 | Unexpected egress (downloads/exports) | Network | Alert on egress spikes; use CDN/cache; keep data local |
| 13 | Oversized databases / replicas | Managed DB | Right-size, remove unused replicas, scale storage properly |
| 14 | Kubernetes request inflation | Kubernetes | Set sane requests/limits; autoscale; avoid idle nodes |
| 15 | Missing tags / unallocated spend | All | Enforce tags; fix top offenders; add ownership |
Most leaks return because the default path is wasteful (no tags, infinite retention, always-on environments). If you fix the defaults, you stop paying the same “cloud tax” every quarter.
Kubernetes cost control: requests, limits, and autoscaling
In Kubernetes, overestimated requests lead to wasted nodes. Underestimated requests lead to throttling and instability. The pattern below sets a reasonable baseline and lets an autoscaler handle bursts. Tune values with real usage metrics.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 2
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: registry.example.com/api:1.2.3
ports:
- containerPort: 8080
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "1000m"
memory: "512Mi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- Start by setting realistic requests (based on p95 usage) before touching limits
- Keep some headroom; let the HPA scale pods rather than over-requesting resources
- Watch node utilization and pod evictions after changes
Common mistakes
Most cost optimization efforts fail for predictable reasons: they optimize the wrong thing, break something, or don’t stick. Here are the pitfalls to avoid (and what to do instead).
Mistake: Optimizing before you have ownership
If no one owns spend, no one maintains the fix.
- Fix: tag standards (team/service/env) + an “unallocated spend” target.
- Fix: route alerts to owners, not a shared inbox.
Mistake: Deleting resources without verifying impact
A cheap win that causes downtime becomes an expensive incident.
- Fix: identify last access, check dependencies, and snapshot first when unsure.
- Fix: use “quarantine tags” and short review windows.
Mistake: Right-sizing based on averages
Averages hide spikes; spikes are what break production.
- Fix: use p95/p99, add headroom, and downsize in steps.
- Fix: validate with latency/error signals (not just “CPU looks fine”).
Mistake: Buying commitments too early
Commitments lock in waste if the baseline is oversized.
- Fix: right-size first, then commit to steady-state capacity.
- Fix: start conservative and increase coverage gradually.
Mistake: Ignoring network and data transfer
Egress and NAT costs can quietly rival compute.
- Fix: make egress a first-class metric and alert on spikes.
- Fix: reduce cross-zone/region chatter; use caching/CDNs where appropriate.
Mistake: “One-time cleanup” with no guardrails
Leaks return because defaults stay wasteful.
- Fix: enforce tags, retention, and schedules by policy.
- Fix: add a monthly cost review cadence and track regression.
If you can answer “what changed” when spend moves (up or down), your cost program is working. If every spike is a mystery, fix allocation and visibility first.
FAQ
What is cloud cost optimization, in practical terms?
Cloud cost optimization is the discipline of reducing waste while keeping reliability and performance intact. Practically, it means deleting unused resources, right-sizing what remains, using the right storage tiers and retention, managing data transfer, and adding guardrails so leaks don’t reappear.
Where do most teams find the first meaningful savings?
Usually in idle resources (stopped instances, unattached volumes, unused IPs/load balancers), over-retained data (logs/snapshots/artifacts), and overprovisioned compute. These are measurable and typically low-risk when handled with ownership and rollback plans.
How do we right-size safely without causing outages?
Use 14–30 days of metrics, size to p95 (not average), keep headroom, and make changes in small steps. Validate with user-facing signals (latency/error rate) and always document a rollback path.
Should we focus on reserved instances / savings plans / committed use?
Yes, but after you fix the baseline. Commitments amplify whatever baseline you have: if it’s oversized, you lock in waste. Start conservative on steady-state capacity, prefer flexible commitments when uncertain, and revisit monthly.
How important are tags for cost optimization?
Tags (or labels) are foundational. Without allocation, you can’t answer “who owns this spend,” and optimization becomes slow and political. Aim to reduce “unallocated” spend steadily and enforce tagging on new resources.
What are the biggest “hidden” cloud cost drivers?
Data transfer/egress, NAT gateway processing, managed database replicas and provisioned IOPS, log ingestion and retention, and Kubernetes request inflation are common surprises. The fix is usually visibility (dashboards/alerts) plus a small design change (routing, caching, autoscaling, retention).
How often should we review cloud spend?
At least monthly for a comprehensive review, with weekly checks for anomalies in high-spend environments. A lightweight cadence that works: weekly “top anomalies” plus a monthly “top services + top teams + unallocated spend” review.
Cheatsheet
Use this for a monthly cloud cost optimization review. It’s intentionally compact: the goal is to spot the biggest leaks fast and assign owners.
Monthly review (30–45 minutes)
- Top 10 services by cost (trend up/down)
- Top 10 projects/accounts by cost
- Top 10 teams/services by cost (requires tags)
- Unallocated spend % (missing tags) and top offenders
- Biggest anomalies (day/week spikes)
Weekly quick checks (10 minutes)
- New resources without required tags
- New public IPs / load balancers without traffic
- Egress and NAT costs spike check
- Log ingestion growth and retention exceptions
- Dev/test schedules still enforced
“Before cleanup” safety checklist
| Resource type | Verify | Safe first move |
|---|---|---|
| Instances/VMs | Owner + last CPU/network activity + attached volumes | Stop in non-prod; snapshot critical disks |
| Volumes/disks | Attachment status + last read/write | Tag “cleanup-candidate”; snapshot before delete |
| Load balancers | Targets + traffic + DNS references | Disable listeners / remove DNS, then delete |
| Snapshots/backups | Retention policy + restore test for critical paths | Implement lifecycle; keep recent restore points |
| Logs/artifacts | Compliance/incident requirements | Reduce retention + archive cold storage if needed |
Make the default path cost-aware: enforce tags, enforce retention, and enforce schedules for non-prod. Then you only hunt exceptions.
Wrap-up
Cloud cost optimization doesn’t require a massive rewrite. In your first month, you can save real money by fixing the obvious leaks: delete unused resources, reduce retention, right-size the biggest workloads, make egress visible, and enforce ownership with tags. The lasting payoff comes from guardrails that prevent regressions.
Your next actions
- Today: enable budgets/alerts, set log retention defaults, and run an idle resource inventory
- This week: clean up 3 low-risk leak categories (unattached disks, unused IPs, zombie LBs)
- This month: right-size the top 5 spenders and reduce “unallocated spend” via enforced tags
- Ongoing: monthly review cadence + automation for schedules and lifecycles
If you want to make these changes safer and more repeatable, pair cost work with infrastructure-as-code and runbooks. The related posts below cover Terraform pitfalls, CI/CD patterns, and Kubernetes basics that help keep cost changes controlled.
Quiz
Quick self-check (demo). This quiz is auto-generated for cloud / devops / cloud.