Terraform is at its best when infrastructure changes feel boring: review a plan, apply, move on. It’s at its worst when “just one tiny change” triggers a 600-resource diff, a broken state lock, and a weekend of “why is prod drifting?” This post is a practical tour of the most common Terraform mistakes—especially around state, modules, and the “one big plan” trap—plus patterns that keep your IaC maintainable as your cloud grows.
Quickstart
If you only do a few things to avoid painful Terraform mistakes, do these. They reduce blast radius, increase safety, and make “plan” output trustworthy again.
1) Move state to a remote backend (with locking)
Local state is fine for a tutorial. For shared infrastructure it’s a foot-gun: no locking, no history, easy to lose. Remote state + locking prevents concurrent applies and gives you a single source of truth.
- Pick one backend per environment (dev/stage/prod)
- Enable state locking (where supported)
- Restrict access: state files often contain sensitive values
2) Split “one big state” into smaller stacks
Keep blast radius small: networking, shared platform, and each application stack should typically have independent state. Smaller states mean faster plans, clearer diffs, and safer rollouts.
- Separate shared foundations from app stacks
- Define clear ownership (“who applies this?”)
- Use remote state outputs only when needed
3) Treat modules as interfaces (not copy-paste)
A good module is a stable contract: inputs, outputs, and predictable behavior. A bad module is a pile of resources that leaks implementation details and becomes hard to change.
- Keep module inputs small and intentional
- Expose outputs that downstream stacks actually need
- Pin module versions and document upgrade steps
4) Make “plan review” a first-class step
Most Terraform disasters come from applying a plan nobody reviewed, or from comparing plans across different environments. Save plans, review them, and apply exactly what was reviewed.
- Run
terraform fmt+validatein CI - Generate a plan file for apply (no surprise diffs)
- Require approvals for production applies
Don’t start by rewriting everything. Start by reducing risk: remote state + locking + smaller stacks + repeatable plan/apply. You’ll immediately feel the difference.
Overview
Terraform mistakes usually aren’t “syntax mistakes.” They’re design mistakes that only show up after you scale: more engineers, more environments, more modules, more resources. The failure modes are predictable: state gets messy, module boundaries get fuzzy, and a single command tries to change the entire world.
What this post covers
- State safety: where state lives, how locking works, and why drift becomes expensive
- Module sanity: how to design module interfaces that stay stable over time
- The “one big plan” trap: why monolithic stacks create giant diffs and risky rollouts
- Practical steps: a blueprint to split stacks, reduce blast radius, and build a repeatable workflow
The real goal
You don’t want “more Terraform.” You want infrastructure changes that are:
- Predictable (plan matches apply)
- Reviewable (diffs are understandable)
- Reproducible (same inputs → same result)
- Low blast radius (mistakes don’t take down everything)
A quick mental model
Think of each Terraform state as a “deployment unit.” If you wouldn’t deploy all services in your company with one release button, you probably shouldn’t manage them with one state either.
- One state = one blast radius
- One state = one lock
- One state = one team’s ownership (ideally)
If you’re already running Terraform in production, jump to Step-by-step. If you’re new, skim Core concepts first so the fixes make sense.
Core concepts
Before we talk about fixes, you need three foundational ideas: what state really is, what modules are really for, and why “one big plan” feels convenient right up until it doesn’t.
1) Terraform state: the truth Terraform uses
Terraform doesn’t “discover” your infrastructure from scratch each run. It tracks what it created in a state file, and uses that state to compute diffs. That’s why state is both powerful and dangerous: if state is wrong, the plan can be wrong.
What state contains (and why you should care)
| State contains… | Why it matters | Common risk |
|---|---|---|
| Resource addresses + IDs | Maps Terraform config to real cloud objects | Refactors can “lose” resources without careful moves |
| Last-known attributes | Used to compute diffs and detect drift | Manual changes create surprising plans |
| Outputs | How stacks share data (URLs, ARNs, IDs) | Leaky coupling between stacks |
| Potential secrets | Some providers store sensitive values | State exposure is a security incident |
Even if you mark variables as sensitive, parts of state may still be sensitive depending on provider behavior. Treat state storage like a production secret store: restrict access, log access, and avoid copying it around.
2) Remote backends + locking: preventing “two people applied at once”
A remote backend centralizes state storage. Locking prevents two applies from running concurrently on the same state. Without locking, you can get conflicting updates, partial applies, and the classic “Terraform is haunted” feeling.
3) Modules: reusable building blocks with stable interfaces
Modules are best used to enforce consistency: naming, tagging, network rules, IAM patterns, and “known good” defaults. They’re worst used as “everything in one module” or as copy-paste folders that diverge immediately.
Good module traits
- Clear, small input surface
- Documented defaults
- Outputs designed for consumers
- Versioned changes (upgrade path)
Bad module smells
- Hundreds of variables “just in case”
- Hidden behavior (side effects you can’t control)
- Hard-coded environment assumptions
- Consumers depend on internal resource names
4) The “one big plan” trap: why monolith stacks fail at scale
The trap looks like this: you start with one repo and one root module. It’s fast. It’s simple. Then you add environments, shared resources, multiple teams, and many modules. Suddenly: every plan is huge, every apply takes forever, and you can’t change one app without touching ten others.
Why it happens
- Blast radius: one plan controls everything
- Coupling: stacks share too many implicit dependencies
- Lock contention: one state lock blocks unrelated work
- Diff noise: tiny changes get buried in massive output
- Rollout risk: one mistake affects many services
Step-by-step
This is a practical guide to escape fragile Terraform setups. You can apply it whether you’re starting fresh or refactoring an existing monolith. The goal is repeatability: small plans, safe applies, clear ownership.
Step 1 — Choose your state boundaries (stacks)
Start by splitting your infrastructure into “deployment units” that can change independently. The best boundaries often match ownership and failure domains, not cloud services.
A stack split that works for many teams
| Stack | Contains | Changes frequency | Notes |
|---|---|---|---|
| foundation | Org-level IAM, KMS, DNS base, audit/logging | Rare | High blast radius → strict review |
| networking | VPC/VNet, subnets, routing, shared endpoints | Occasional | Stable outputs consumed by many |
| platform | Kubernetes/ECS cluster, shared databases, registries | Occasional | Owned by platform team |
| apps/<service> | Service-specific compute, queues, alarms, config | Frequent | Independent rollouts per service |
If a team needs to apply changes daily, don’t put their resources in a state that also controls rarely-changing foundations. Frequent + rare changes in one state is how “one big plan” becomes permanent.
Step 2 — Set up remote state + locking
Remote state is table stakes for collaboration. Locking prevents concurrent applies. Access control keeps state safe. Here’s an example backend configuration for an AWS-style setup (adapt it to your cloud and backend choice).
terraform {
required_version = ">= 1.6.0"
backend "s3" {
bucket = "acme-terraform-state"
key = "prod/networking/terraform.tfstate"
region = "eu-central-1"
dynamodb_table = "acme-terraform-locks"
encrypt = true
}
}
provider "aws" {
region = "eu-central-1"
default_tags {
tags = {
managed_by = "terraform"
env = "prod"
}
}
}
Backend checklist
- State storage is encrypted at rest
- Locking is enabled and reliable
- Access is least-privilege (read vs write)
- Audit logs exist for state access
Common gotchas
- Changing the backend key moves the state location (intentional, but risky)
- State locks can persist if a run crashes (know how to recover safely)
- Multiple CI jobs on the same state cause lock contention (use separate stacks)
Step 3 — Design modules like products (inputs/outputs as a contract)
A module should hide internal resource naming and expose a stable API. The easiest way to enforce this is to keep the module’s variable list small, name inputs after business intent, and export only what downstream stacks need.
# modules/app_service/main.tf (sketch)
variable "name" {
type = string
description = "Service name used for naming and tagging."
}
variable "env" {
type = string
description = "Environment (dev/stage/prod)."
}
variable "subnet_ids" {
type = list(string)
description = "Where the service runs."
}
variable "image" {
type = string
description = "Container image (immutable tag or digest preferred)."
}
# ...resources go here...
# - compute (ECS/EKS/VM)
# - security group rules
# - autoscaling
# - alarms
output "service_url" {
description = "Public URL or internal endpoint for consumers."
value = "https://example.invalid/${var.name}"
}
# root stack usage (apps/payments/main.tf)
module "payments" {
source = "../../modules/app_service"
name = "payments"
env = "prod"
subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
image = "registry.example.com/payments@sha256:deadbeef..."
}
Modules should compose cleanly, not depend on each other in circles. If module A needs deep internals of module B, you likely need a higher-level “stack” boundary or a better output contract.
Step 4 — Build a repeatable plan/apply workflow (human + CI)
The safest Terraform workflow is: format, validate, plan, review, apply the exact reviewed plan. This reduces “works on my machine” differences and prevents applying a different plan than the one approved.
name: terraform
on:
pull_request:
push:
branches: [ "main" ]
jobs:
plan:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.5
- name: Format + validate
run: |
terraform fmt -check -recursive
terraform init -input=false
terraform validate
- name: Plan (no apply on PR)
run: |
terraform plan -input=false -no-color -out=tfplan
apply:
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
needs: [ plan ]
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.5
- name: Apply (main only)
run: |
terraform init -input=false
terraform apply -input=false -auto-approve tfplan
Workflow checklist
- Same Terraform version in dev + CI
- Plans are generated in CI (not on laptops)
- Applies are gated (approvals for prod)
- Apply uses the saved plan (no surprise diffs)
When you should slow down
- Plans contain replacements for critical resources
- Provider upgrades changed behavior
- State drift is detected (manual changes)
- A refactor changes resource addresses
Step 5 — Refactor safely: move state, don’t recreate resources
Refactors are where teams accidentally destroy production. The key idea: when you change a resource’s address (module path, name,
for_each key), Terraform may think it’s a new resource. Use state move operations to preserve identity.
A safe refactor sequence
- Make the refactor in small steps (one logical move at a time)
- Run plan and confirm Terraform is not replacing important resources
- Use state move operations where necessary (treat it like a migration)
- Apply during a low-risk window if blast radius is non-trivial
A healthy Terraform setup produces plans that are small, understandable, and reviewable. If your plans are consistently noisy, it’s a design smell—not a personal failing.
Common mistakes
These are the patterns behind “Terraform is scary.” Each mistake includes a fix you can apply without rewriting your entire codebase.
Mistake 1 — Using local state for shared infrastructure
Local state breaks collaboration and increases the chance of drift and accidental overwrites.
- Fix: remote backend + locking + access control.
- Extra: keep a documented recovery procedure for stuck locks.
Mistake 2 — One state to rule them all
A single giant state makes every change high-risk, slow, and hard to review.
- Fix: split into stacks (foundation/network/platform/apps).
- Extra: give each stack clear ownership and a separate CI job.
Mistake 3 — Treating modules as dumping grounds
Huge modules with dozens of toggles become impossible to change safely.
- Fix: design module contracts (few inputs, meaningful outputs).
- Extra: prefer composition: smaller modules + stacks that wire them together.
Mistake 4 — Unpinned versions (Terraform, providers, modules)
Upgrades are good—surprise upgrades are not. Drift sneaks in through unplanned changes.
- Fix: pin versions and upgrade intentionally with release notes.
- Extra: keep upgrade PRs small and isolated.
Mistake 5 — Relying on implicit ordering
Terraform is declarative. If ordering matters, make dependencies explicit.
- Fix: use references (and
depends_ononly when truly needed). - Extra: avoid “just add depends_on everywhere” as a substitute for design.
Mistake 6 — Refactors that change addresses without state moves
Renaming a resource or switching to for_each can look like “delete + recreate”.
- Fix: refactor in steps and move state deliberately.
- Extra: verify plans show moves (not replacements) for critical resources.
Mistake 7 — The “plan looks fine” fallacy (drift + noise)
If engineers stop reading plans because they’re always huge, you’ve built a system where failures are inevitable. Plan noise comes from monolith states, inconsistent naming, broad changes, and uncontrolled inputs.
- Split stacks to reduce diff size
- Keep modules stable and versioned
- Reduce cross-stack coupling (only share essential outputs)
- Track drift and investigate unexpected diffs early
“It’s probably fine, just apply.” If the plan is not understandable, fix the structure first. Terraform is powerful, but it’s not a substitute for change management.
FAQ
Should I use Terraform workspaces for environments?
Workspaces can work, but they’re easy to misuse. They share the same configuration and differ only by state. For many teams, separate folders/stacks per environment (with separate state keys) is clearer and reduces accidental cross-env applies. If you do use workspaces, enforce them in CI and never “guess” which workspace you’re in.
What’s the best way to structure Terraform state for a growing cloud?
Favor multiple small states (“stacks”) over one giant state. A common structure is foundation, networking, platform, and per-service app stacks. Each state should have a clear owner, a clear apply process, and a limited blast radius.
How do I share outputs between stacks without creating tight coupling?
Share only stable, foundational outputs (subnet IDs, cluster endpoints, core DNS zones) and keep them versioned and documented. Avoid sharing internals (resource names, full policy docs) unless you truly want consumers to depend on them. If many stacks need the same data, that’s a hint it belongs in a foundation/platform layer.
Why does Terraform want to replace a resource after a refactor?
Terraform tracks identity by resource address (module path + resource name + keys). If that address changes, Terraform may treat it as a new resource. The fix is to refactor in steps and move state so Terraform understands it’s the same underlying cloud object.
Is it okay to run Terraform apply from a developer laptop?
For low-stakes dev stacks: sometimes. For shared staging/prod: it’s risky. CI-based applies give you consistent versions, consistent environment variables, audit trails, and approval gates. If laptops are involved, set strict rules: pinned versions, remote state, and mandatory plan review.
How do I avoid the “one big plan” trap in a mono-repo?
A mono-repo is fine; the trap is one root module/state controlling everything. Keep separate stacks inside the repo (separate backends/state keys), and run CI per stack based on changed paths. You get the code-sharing benefits of a mono-repo without the blast radius of a monolith state.
Cheatsheet
A scan-fast checklist to avoid the most common Terraform mistakes (state, modules, and big plans).
State & safety
- Remote backend configured (no shared local state)
- Locking enabled and reliable
- State access is least-privilege
- State is encrypted + audited
- Separate state per stack/environment
Stacks (avoid “one big plan”)
- Foundation/network/platform/app stacks separated
- Ownership is clear (who applies what)
- Cross-stack dependencies are minimal and intentional
- Plans are small enough to review in minutes
- Locks don’t block unrelated teams
Modules (design as contracts)
- Small, intentional variable surface
- Outputs match consumer needs (not internals)
- Versioned modules with upgrade notes
- Defaults are safe and documented
- No circular module dependencies
Workflow (plan/apply)
- Same Terraform version in CI and dev
fmt+validateare automated- Plan is generated in CI and reviewed
- Apply uses the saved plan (no surprise diffs)
- Prod applies require approval
First fix state (remote + locking), then split stacks, then improve modules. That order reduces risk fastest and makes every later improvement easier.
Wrap-up
Most Terraform pain isn’t “Terraform being hard.” It’s predictable consequences of three design choices: fragile state handling, unclear module boundaries, and the temptation to run one giant plan for everything.
Your next 60 minutes
- Confirm state is remote and locked
- Identify your biggest state blast radius (what does one apply touch?)
- Pick one stack split to implement next (often “networking” vs “apps”)
- Make plan review repeatable (CI plan + saved plan apply)
Once you build these habits, Terraform becomes what it should be: a reliable tool for controlled change. And your future self won’t fear the plan output.
If you’re improving a real cloud setup, pair this article with cost, networking, and policy guardrails. Those systems become dramatically easier once your Terraform structure is sane.
Quiz
Quick self-check (demo). This quiz is auto-generated for cloud / devops / terraform.