Cloud & DevOps · IaC Testing

Test Your Infrastructure: Policy-as-Code and Guardrails

Prevent misconfigs before they hit production.

Reading time: ~8–12 min
Level: All levels
Updated:

“Infra tests” sound fancy until the day a tiny Terraform change opens a database to the internet or disables encryption because a variable default was wrong. Policy-as-code and guardrails turn those incidents into a failed PR instead of a production outage. This guide shows how to test your infrastructure like you test application code: fast feedback, clear rules, and automated enforcement.


Quickstart

These are the highest-impact guardrails you can add in a day. The goal is not to “boil the ocean” — it’s to block the most expensive misconfigs and make safe changes easy.

1) Pick 6–10 “never again” rules

Start with a short policy set that prevents your most common (and most damaging) mistakes. Keep it opinionated. You can always expand later once teams trust the workflow.

  • No public ingress on sensitive ports (SSH/RDP/DB)
  • Encryption required for data at rest (storage, DB snapshots)
  • No plaintext secrets in IaC (variables, locals, outputs)
  • Mandatory tags/labels (owner, env, cost center)
  • Approved regions only (if applicable)
  • Production changes must come from CI, not laptops

2) Add a CI gate: plan → policy test → merge

Your best “shift-left” moment is the pull request. Generate a plan, convert it to JSON, run policies against it, and fail the build if a rule is violated.

  • Run fmt + validate first
  • Create a plan artifact for review
  • Run policy tests on the plan (deny unsafe diffs)
  • Block merge if policies fail

3) Create a clean exception process

Guardrails only stick if developers can ship. Define how to request an exception, how long it lasts, and who approves. Avoid “comment out the policy” as your escape hatch.

  • Exceptions are time-bound (expiration date)
  • Exceptions are reviewed (security/platform approval)
  • Exceptions are documented (reason + ticket link)
  • Exceptions are auditable (in repo)

4) Enforce at multiple layers (not just CI)

CI is great, but not perfect. Add one additional enforcement point (cloud org policy, Kubernetes admission, or runtime checks) to reduce bypass risk and catch drift.

  • CI gate for PRs (fast feedback)
  • Deployment-time guardrail (admission/policy engine)
  • Runtime monitoring for drift and out-of-band changes
What to do first if you’re overwhelmed

Start with one critical rule (for example: “no public database ingress”) and enforce it in CI. Once the workflow is proven and non-annoying, add rules in small batches.

Overview

Infrastructure-as-Code gives you repeatability. But it also makes it easy to repeat mistakes at scale. A single wrong variable, a permissive security group, or an “it’s just staging” shortcut can turn into a production incident once that pattern is copied everywhere.

Testing your infrastructure is about catching those mistakes early, before they become real resources. The techniques in this post apply whether you use Terraform, Kubernetes manifests, Helm charts, or cloud-native templates: the core idea is the same — treat infrastructure changes like code changes with automated checks.

What you’ll build

  • A layered guardrail strategy (pre-commit → CI → deploy-time → runtime)
  • A policy-as-code “deny list” for unsafe infra diffs
  • A repeatable workflow that fails PRs, not production
  • A practical exceptions model so teams can still ship

Why policy-as-code beats wiki rules

Docs are necessary, but they don’t execute. Policy-as-code is executable documentation: it runs on every change, stays versioned with your repo, and provides consistent feedback.

  • Automated enforcement (no “did you remember?”)
  • Version control (why a rule changed and when)
  • Reviewable diffs (policies evolve like code)
  • Repeatable across teams and environments

What guardrails are (and aren’t)

Guardrails are constraints that keep you inside safe boundaries. They’re not a replacement for architecture reviews or for thinking.

  • Guardrail: “No public DB ingress”
  • Guardrail: “Encryption required on storage”
  • Not a guardrail: “Use service X for everything”
  • Not a guardrail: “Security will review every PR manually”
The “shift-left” promise, made real

The best policy systems do one thing really well: they turn risky infra changes into actionable, developer-friendly errors while the change is still cheap to fix.

Core concepts

To “test infrastructure,” you need a few shared definitions. These help you choose tools and design policies that survive real-world change (new services, new providers, new teams).

Policy-as-code

Policy-as-code is the practice of expressing rules (security, compliance, cost, reliability, naming, tagging) as code that can be executed automatically. Policies live in version control, are reviewed via pull requests, and run in CI/CD or at deploy time.

Guardrails

Guardrails are the enforcement mechanism: they make sure policy is applied consistently. Some guardrails are “hard” (deny the change), others are “soft” (warn, create a ticket, require approval). Mature organizations typically use both, depending on risk.

Shift-left vs. fail-safe

Layer Where it runs What it’s best for Tradeoff
Pre-commit Developer machine Fast linting, basic checks Can be bypassed
CI gate Pull requests Blocking risky diffs before merge Needs good error messages
Deploy-time Admission/policy engine Enforce rules even if CI is bypassed Can slow deployments
Runtime Cloud/K8s monitoring Detect drift/out-of-band changes Usually alerts after the fact

What to test: intent, not implementation details

The most durable policies test intent (“no public DB access”) rather than fragile implementation details (“this specific resource name must exist”). Intent-based policies survive refactors and module changes.

Good policy examples (intent-based)

  • Ingress on port 5432 must not be open to the internet
  • Public buckets must have explicit justification + approval
  • All resources require owner and environment tags
  • Production changes require CI-based apply

Brittle policy examples (implementation-based)

  • Resource name must start with “prod-” (breaks refactors)
  • Must use module X (locks teams into one approach)
  • Every VPC must have exactly N subnets (not always true)
  • All policies must block 100% of violations (no runway for adoption)
A policy that developers can’t understand is a policy that will be bypassed

The fastest way to kill guardrails is to ship error messages that don’t explain what to do next. Treat policy output like compiler errors: precise, actionable, and linked to the rule’s purpose.

Step-by-step

This walkthrough builds a concrete workflow: generate a Terraform plan, run policy tests against the plan, and block merges when rules are violated. The same pattern works for other IaC inputs (Kubernetes manifests, Helm output, etc.) — you just swap the “thing to evaluate.”

Step 1 — Decide what you’re enforcing (and at which layer)

Don’t start by choosing a tool. Start by choosing rules and enforcement points. Your first goal is coverage of high-risk mistakes, not a perfect policy platform.

A simple policy rollout plan

  • Week 1: Add CI warnings (non-blocking) for obvious issues
  • Week 2: Convert the top 3 rules to blocking (deny)
  • Week 3: Add an exceptions process (time-bound)
  • Week 4: Add deploy-time enforcement for the highest-risk rules

Step 2 — Generate a plan artifact you can test

For Terraform, the most reliable “input” to policy testing is the plan (what Terraform intends to change), not raw HCL. Convert the plan to JSON so a policy engine can evaluate it deterministically.

# 1) Initialize and create a plan file (do not apply)
terraform init -input=false
terraform plan -input=false -out=tfplan

# 2) Convert the plan into machine-readable JSON
terraform show -json tfplan > plan.json

# 3) Run policy tests against the plan (example: Conftest/OPA)
conftest test plan.json -p policy
Why plan-based testing is powerful

A plan includes resolved values, computed changes, and resource actions (create/update/delete). That makes policies more accurate: you can detect “open ingress will be created” rather than trying to infer it from partial configuration.

Step 3 — Write one policy rule that blocks a real incident

The best starter rule is something your team has tripped over before. A classic example: “Don’t allow public ingress to sensitive ports.” Below is a tiny policy sketch that denies changes introducing public ingress on port 22 (SSH) to 0.0.0.0/0. Adapt the idea to your cloud/provider and your ports.

package terraform.guardrails

default deny = []

# Deny when a security rule allows 0.0.0.0/0 to port 22.
# This is intentionally conservative and meant as a starting point.
deny[msg] {
  rc := input.resource_changes[_]
  rc.type == "aws_security_group_rule"
  rc.change.after.cidr_blocks[_] == "0.0.0.0/0"
  rc.change.after.from_port <= 22
  rc.change.after.to_port >= 22
  lower(rc.change.after.protocol) == "tcp"
  msg := sprintf("Public SSH ingress is not allowed: %s", [rc.address])
}

Make the rule developer-friendly

  • Error includes what was wrong (public SSH ingress)
  • Error includes where (resource address)
  • Rule is easy to reason about (no hidden magic)
  • Rule has a clear “how to fix” (restrict CIDR / use bastion / SSM)

Improve it over time

  • Handle IPv6 (::/0) if relevant
  • Expand port set (RDP, DB ports) based on your risk model
  • Account for approved jump hosts / corporate IP ranges
  • Add a controlled exception mechanism

Step 4 — Wire policies into CI so unsafe changes can’t merge

The workflow is: run formatting/validation, create a plan, convert to JSON, test policies, and fail if denied. The key guardrail: treat policy test failures like unit test failures — no merge until fixed or approved via exception.

name: iac-guardrails

on:
  pull_request:

jobs:
  terraform-policy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.5

      - name: Terraform fmt + validate
        run: |
          terraform fmt -check -recursive
          terraform init -input=false
          terraform validate

      - name: Create plan and convert to JSON
        run: |
          terraform plan -input=false -out=tfplan
          terraform show -json tfplan > plan.json

      - name: Policy tests (fail PR if denied)
        run: |
          conftest test plan.json -p policy
Don’t silently “warn forever”

Non-blocking warnings are useful for adoption, but they shouldn’t be the final state for high-risk rules. Pick a date to flip critical policies from warn to deny, and communicate it early.

Step 5 — Define an exception process that doesn’t rot

Exceptions are normal. “Permanent exceptions” are how guardrails die. A good exception process is explicit, time-bound, and auditable — and ideally makes the risky decision visible.

A lightweight exception checklist

  • Scope: Which resource(s) and which rule?
  • Reason: Why is this needed right now?
  • Mitigation: What reduces risk (IP allowlist, temporary access, monitoring)?
  • Expiry: When does it end, and who owns the follow-up?
  • Approval: Platform/Security sign-off for high-risk exceptions

Step 6 — Add a second enforcement layer for “must not happen” rules

CI reduces risk dramatically, but it’s not absolute. For the highest-risk categories (public exposure, privileged access, unencrypted storage), add deploy-time enforcement where possible. In Kubernetes, this is often admission control. In cloud orgs, it might be org policies / service control rules / centralized constraints.

Good candidates for deploy-time deny

  • Public access to sensitive services
  • Disabling encryption controls
  • Privileged container workloads
  • Production resources in unapproved regions

Better as CI-only (usually)

  • Naming conventions
  • Tag completeness
  • Preferred module usage
  • Style and linting rules
Measure success

Track which policies fail most and why. If a rule fails constantly, it might be unclear, too strict, or missing a supported workflow. Guardrails should guide good behavior, not force workarounds.

Common mistakes

Guardrails fail for predictable reasons. Avoid these pitfalls and your policy-as-code program will feel like a productivity boost instead of a bureaucracy generator.

Mistake 1 — Starting with 200 rules

Large policy sets overwhelm teams and create endless false positives.

  • Fix: start with a short “top risks” set, then expand slowly.
  • Tip: make the first rules extremely clear and hard to argue with.

Mistake 2 — Policies that are impossible to understand

If developers can’t tell what failed and how to fix it, they’ll bypass the system.

  • Fix: write policy errors like compiler errors: what, where, how to fix.
  • Tip: link policies to internal docs or examples in the repo.

Mistake 3 — No exception path (or unlimited exceptions)

No exceptions blocks delivery; unlimited exceptions makes policies meaningless.

  • Fix: time-bound, reviewed exceptions with clear mitigations.
  • Tip: treat exceptions as debt — they need an owner and a due date.

Mistake 4 — Testing raw IaC instead of planned changes

Raw config can miss computed values and actions; plans show intent and diffs.

  • Fix: run policies against plan output (or rendered manifests).
  • Tip: store plan artifacts for review and audit.

Mistake 5 — “Warn-only forever” for critical risks

Warnings are easy to ignore, especially when they’re noisy.

  • Fix: set a date to enforce (deny) high-risk policies.
  • Tip: phase rollout: warn → deny with clear communication.

Mistake 6 — Tool/version drift in CI

If different environments run different versions, policy results become inconsistent.

  • Fix: pin versions (Terraform, policy engine, scanners).
  • Tip: upgrade versions intentionally via a dedicated PR.

Mistake 7 — Guardrails without ownership

Policies are living code. If nobody owns them, they decay: rules get outdated, exceptions pile up, and teams stop trusting the system.

  • Assign a policy owner (platform/security collaboration)
  • Review policy failures monthly (false positives vs real risk)
  • Retire policies that no longer match architecture
  • Keep examples and remediation steps up to date

FAQ

What is policy-as-code in simple terms?

Policy-as-code means writing infrastructure rules as executable code and running those rules automatically on every change. Instead of relying on a checklist or a wiki page, your CI/CD pipeline enforces requirements consistently and produces clear failures when a change violates policy.

Where should guardrails run: pre-commit, CI, or deploy-time?

Ideally, in multiple places. Pre-commit is fastest but easiest to bypass. CI is the best default gate for most teams. Deploy-time enforcement is the “seatbelt” for high-risk rules where bypass is unacceptable. Runtime checks catch drift and out-of-band changes.

Should policies block merges, or just warn?

For low-risk rules (style, naming), warnings can be enough. For high-risk rules (public exposure, encryption, privileged access), blocking is usually the right end state. A good rollout starts with warnings for adoption and flips to blocking once teams have a clear path to comply.

What’s the best way to test Terraform with policy-as-code?

Test the plan. Generate a plan file, convert it to JSON, and evaluate it with your policy engine. Plan-based testing aligns policy with what Terraform intends to do, reducing false positives and catching real risky diffs.

How do we handle legitimate exceptions without weakening security?

Use time-bound, reviewed exceptions with explicit mitigations. Store exceptions in version control, require approvals for high-risk categories, and regularly review expired or unused exceptions. If exceptions keep appearing for the same reason, update the standard workflow or the policy itself.

How do we prevent drift if people change infrastructure manually?

First, limit manual changes by using least-privilege and clear operational paths. Second, add drift detection: scheduled plan checks, cloud audit alerts, or runtime policy monitoring depending on your platform. Drift isn’t just “messy” — it makes plans unreliable, which makes deployments risky.

Cheatsheet

A scan-fast checklist for policy-as-code and infrastructure guardrails that actually work.

Start here (first week)

  • Pick 6–10 high-impact “never again” rules
  • Run fmt + validate in CI
  • Generate plan artifacts in PRs
  • Run policy tests on the plan (warn-only initially)
  • Write actionable policy error messages

Enforcement (weeks 2–4)

  • Flip top 3 high-risk rules to deny (block merge)
  • Add approvals for production IaC changes
  • Pin tool versions (Terraform, scanners, policy engine)
  • Add a time-bound exception workflow
  • Document remediation steps with examples

Rules that pay off quickly

  • No public ingress to sensitive ports
  • Encryption required for storage and databases
  • No plaintext secrets committed
  • Required tags/labels for ownership and cost
  • Production applies only from CI

Keep policies healthy

  • Assign ownership for policies and exceptions
  • Review top failures monthly (noise vs. real risk)
  • Retire policies that don’t match current architecture
  • Prefer intent-based checks over naming-based checks
  • Add a second enforcement layer for “must not happen” rules
If a policy causes workarounds, it needs iteration

Don’t interpret pushback as “developers hate security.” It usually means your policy is missing a supported path. Improve the workflow, refine the rule, or add a safe, auditable exception mechanism.

Wrap-up

Infrastructure breaks in boring ways: a CIDR that’s too wide, a policy flag that got turned off, a secret that ended up in the wrong place. Policy-as-code and guardrails move those failures to where they belong — the pull request — and make safe infrastructure changes routine.

Your next actions

  • Choose 6–10 “never again” rules based on real incidents and near-misses
  • Implement a CI gate: plan → policy test → merge
  • Make policy failures actionable with remediation steps
  • Define time-bound exceptions so teams can ship without bypassing
  • Add a second enforcement layer for your highest-risk rules

The end goal isn’t to add friction. It’s to create confidence: every merge is safer, every deployment is more predictable, and “security checks” become part of normal engineering hygiene.

Quiz

Quick self-check (demo). This quiz is auto-generated for cloud / devops / iac.

1) What’s the main value of policy-as-code for infrastructure changes?
2) Why is plan-based testing often better than testing raw Terraform configuration?
3) Which rollout approach usually works best for new guardrails?
4) What makes an exception process “healthy” for guardrails?