IAM Mistakes That Cost Companies Millions (And the Fixes)

By Samuel Labant Published Jan 9, 2026 Updated Jan 9, 2026

IAM is the control plane for your cloud. When it’s wrong, “one leaked key” turns into admin access, data exfiltration, or a surprise crypto-mining bill. This post breaks down the IAM mistakes that repeatedly show up in breaches and costly incidents—and the practical fixes you can implement without turning your organization into a ticket factory.

Quickstart

If you only have 30–60 minutes, do these in order. They deliver the biggest risk reduction per unit of effort and reduce the most common “million-dollar” failure modes: over-privileged identities and long-lived credentials.

The 6 fastest wins

Turn on MFA everywhere (human users, especially admins) and lock down break-glass accounts.
Stop using long-lived access keys for humans; move to SSO + short-lived role sessions.
Inventory your principals: users, roles, service accounts, CI identities, and third-party integrations.
Remove “*:*” policies and split admin from day-to-day operator roles.
Add guardrails (Org policies / SCPs / conditional access) for “never events” like public storage or disabling logs.
Centralize audit logs (cloud audit + identity provider logs) and alert on key IAM events.

Red flags to search for today

Policies with * actions or * resources without strong conditions
Access keys older than 90 days (or never rotated)
Roles that can be assumed by any principal (wildcard trust)
“Administrator” permissions attached to CI/CD, bots, or service accounts
Permissions to disable logging, change IAM, or modify network boundaries broadly
Shared accounts and shared credentials (no identity attribution)

If you can’t answer “who did this?” you don’t have IAM

A strong IAM posture is not only about blocking attackers. It’s also about traceability: every action should map back to a specific identity, a specific session, and a specific justification (role/policy).

Quick decision table

If you’re doing…	Preferred identity method	Why it’s safer
Human console/CLI access	SSO + short-lived role sessions	Central control, MFA, and fast revocation
CI/CD deployments	Workload identity (OIDC) + role assumption	No stored secrets, scoped to repository/workflow
Service-to-service calls	Service identity + role-based permissions	Least privilege with clear ownership and rotation
Third-party integrations	Dedicated role + constrained trust policy	Limits blast radius and simplifies offboarding

Overview

Most “cloud breaches” are not about zero-days. They’re about identity and access: an attacker finds a credential, a token, or a misconfigured trust relationship, then uses it to move laterally and escalate privileges. That’s why IAM mistakes can cost companies millions—sometimes from direct fraud and data loss, sometimes from incident response, regulatory obligations, downtime, and reputational damage.

What this post covers

The IAM mistakes that show up repeatedly (over-permissioning, trust policy bugs, key sprawl)
How to fix them with concrete patterns (roles, conditions, guardrails, and auditing)
How to design IAM so security doesn’t block delivery (self-service with boundaries)
A scan-fast cheatsheet for ongoing reviews

Who this is for

Developers shipping cloud workloads and CI/CD pipelines
Platform/DevOps teams maintaining “shared” infrastructure
Security teams hardening cloud environments without breaking teams
Anyone doing post-incident cleanup (and wanting it to stick)

The cloud-specific names differ; the IAM problems don’t

AWS IAM, Azure Entra ID/RBAC, and Google Cloud IAM all implement the same fundamental ideas: principals, roles/permissions, resource scope, and conditions. The patterns below are transferable even if your exact policy syntax changes.

Core concepts

IAM is easier when you think in a simple sentence: “Which principal can do what action on which resource under what conditions?” Every policy system is a variation of this.

Identity vs credentials vs permissions

Identity (principal)

A “who”: user, group, role, service account, workload identity, or external principal (IdP / partner).

Humans should be unique identities (no sharing)
Workloads should have dedicated service identities
Third parties should get isolated identities/roles

Credentials (how you prove “who”)

Passwords, MFA factors, access keys, certificates, OIDC tokens, session tokens.

Prefer short-lived tokens over long-lived keys
MFA is mandatory for privileged actions
Rotate/revoke credentials as an operational routine

Roles and session-based access (the “golden path”)

In mature cloud setups, humans and workloads rarely hold long-lived secrets. Instead they assume roles and receive a time-limited session that carries permissions. This dramatically reduces the blast radius of a leaked credential and makes offboarding simpler.

Two policy planes you must understand

Plane	Controls	Common bug	What to verify
Permission policy	What actions are allowed/denied	Overbroad actions/resources (wildcards)	Scope, conditions, and explicit denies/guardrails
Trust policy (assume role)	Who can become the role	Wildcard principals / weak conditions	Allowed principals, audience, repo/workflow constraints, MFA requirements

Least privilege (and why teams struggle with it)

Least privilege means granting the minimum permissions needed to perform a task. Teams often fail here because it’s tempting to “just attach admin” to get unblocked. The trick is to create developer-friendly roles that are still bounded: narrow resource scopes, time-limited elevation, and safe self-service.

Least privilege is a process, not a one-time policy edit

Start broad enough to function, but still bounded by resource prefixes/tags.
Use audit logs to discover the true set of used actions.
Iterate: reduce permissions, add conditions, and keep a break-glass path.

Guardrails: “deny the truly dangerous stuff”

Guardrails (Org policies / SCPs / conditional access / permission boundaries) enforce global rules even if a team accidentally grants too much. This is how you prevent “never events” like disabling audit logs or making sensitive storage public.

Deny beats allow in incident prevention

Over time, somebody will attach an overly broad allow. Guardrails are your safety net: a small set of explicit denies for high-risk actions (turning off logs, changing IAM broadly, or opening public access) will save you when policy hygiene slips.

Step-by-step

Here’s a practical hardening path you can run as a lightweight “IAM sprint.” It’s designed to be realistic: you’ll improve posture while keeping teams productive. Each step includes what to do, why it works, and what to watch out for.

Step 1 — Inventory principals and entry points

You can’t fix IAM if you don’t know who exists. Start with an inventory you can review monthly.

What to inventory

Human users (cloud console + IdP)
Privileged roles (admins, billing, security)
Workload identities (compute, k8s, serverless)
CI/CD identities and third-party integrations
Long-lived credentials (keys, certificates)

What “good” looks like

Humans use SSO + MFA; no shared accounts
Workloads use role/session identity (no embedded secrets)
Each integration has a dedicated role and owner
Every principal has tags/labels: owner, purpose, environment

Step 2 — Fix login posture and break-glass access

Before you chase fine-grained permissions, close the obvious doors: missing MFA, shared admin users, and unmanaged “root”/tenant owner access.

Baseline requirements (non-negotiable)

MFA required for all human accounts; stronger factors for admins
Break-glass accounts exist, but are locked down and monitored (no daily use)
Central IdP is the source of truth for access; offboarding is one switch
Admin actions are limited to dedicated admin roles, not day-to-day accounts

Step 3 — Replace long-lived keys with role/session identity

The fastest way to reduce credential-leak risk is to stop minting permanent access keys. Use SSO for humans and workload identity for automation. This makes secrets scanning less stressful because fewer secrets exist.

Where “million-dollar” incidents start

Long-lived keys leak through Git commits, CI logs, pastebins, laptops, chat screenshots, and vendor tickets. Short-lived sessions don’t eliminate risk, but they drastically shorten the window and simplify rotation/revocation.

Example 1: A minimal policy (least privilege) for a specific bucket prefix

Don’t hand out “S3:* on *” for a service that only reads from one prefix. Scope actions and resources. Add conditions when you can (tags, prefixes, VPC endpoints, source identity).

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadOnlyFromAppPrefix",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion"
      ],
      "Resource": "arn:aws:s3:::my-company-data/app-a/*"
    },
    {
      "Sid": "ListOnlyWithinAppPrefix",
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::my-company-data",
      "Condition": {
        "StringLike": {
          "s3:prefix": [
            "app-a/*"
          ]
        }
      }
    }
  ]
}

Step 4 — Lock down trust policies (who can assume roles)

Permission policies get a lot of attention, but trust policies are where attackers slip in. A role with great least-privilege permissions is still dangerous if anyone can assume it.

Trust policy checks

No wildcard principals for assumption
Constrain federation (audience, issuer, subject)
Require MFA for privileged role assumption (for humans)
Use separate roles per environment (dev/stage/prod)

Common “oops” patterns

Any GitHub repo can assume the deploy role
Any workload in the cluster can assume a high-priv role
Third-party role trust is left open after a pilot
Role chaining without boundaries (easy escalation)

Example 2: Terraform sketch for CI/CD using OIDC (no stored access keys)

This pattern avoids long-lived secrets in CI by letting the workflow exchange an OIDC token for a time-limited role session. The critical part is the trust policy conditions that scope assumption to a specific repo/branch/workflow.

resource "aws_iam_role" "github_actions_deploy" {
  name = "github-actions-deploy-prod"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          Federated = aws_iam_openid_connect_provider.github.arn
        }
        Action = "sts:AssumeRoleWithWebIdentity"
        Condition = {
          StringEquals = {
            "token.actions.githubusercontent.com:aud" = "sts.amazonaws.com"
          }
          StringLike = {
            # Lock to a repo and branch (tighten further to workflow if you can)
            "token.actions.githubusercontent.com:sub" = "repo:my-org/my-repo:ref:refs/heads/main"
          }
        }
      }
    ]
  })
}

resource "aws_iam_role_policy" "deploy_permissions" {
  role = aws_iam_role.github_actions_deploy.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowDeployToSpecificResources"
        Effect = "Allow"
        Action = [
          "ecs:UpdateService",
          "ecs:DescribeServices",
          "iam:PassRole"
        ]
        Resource = [
          "arn:aws:ecs:us-east-1:123456789012:service/prod/*",
          "arn:aws:iam::123456789012:role/prod-ecs-task-role"
        ]
      }
    ]
  })
}

Step 5 — Implement permission boundaries and “just-in-time” elevation

The best way to avoid chaos is to make the secure thing the easy thing. Instead of forcing every team to become IAM experts, provide: pre-approved roles, permission boundaries, and a temporary elevation path for rare admin tasks.

A workable model for most orgs

Day-to-day roles: scoped per team/service/environment
Elevation role: time-limited, requires MFA/approval, heavy logging
Boundaries/guardrails: deny “never events” regardless of team policies
Ownership metadata: every role/policy has an owner and purpose tag

Step 6 — Kill key sprawl with rotation, detection, and automation

Some keys will still exist (vendor integrations, legacy systems). Make them safe: reduce scope, rotate regularly, and monitor for abnormal use. “Set and forget” is what turns a small leak into a big incident.

Example 3: Find old access keys and disable unused ones (AWS CLI)

This is a simple starting point for cleanup. In production, pair this with approvals and a safe rollback plan (disable first, then delete). Always coordinate with service owners to avoid breaking workloads.

# List access keys and their creation dates for all IAM users
aws iam list-users --query 'Users[].UserName' --output text | tr '\t' '\n' | while read -r user; do
  aws iam list-access-keys --user-name "$user" \
    --query 'AccessKeyMetadata[].{User:UserName,KeyId:AccessKeyId,Status:Status,Created:CreateDate}' \
    --output table
done

# Check last-used date for a specific key (helps find stale keys)
aws iam get-access-key-last-used --access-key-id AKIAEXAMPLEKEYID

# Disable a key (safer first step than deleting)
aws iam update-access-key --user-name some-user --access-key-id AKIAEXAMPLEKEYID --status Inactive

Step 7 — Log, alert, and rehearse response

Even with good policies, you need detection. The IAM events you alert on should map to “attack progress” steps: new credentials, privilege escalation, persistence, and log tampering.

High-signal alerts

New access keys created (especially for privileged users)
Policy attached/updated with broad privileges
Role trust policy changed
MFA disabled, password policy weakened
Audit logging disabled or modified

Incident drill checklist

Can you revoke sessions quickly (SSO/IdP kill switch)?
Do you know where logs are centralized and immutable?
Do you have “break-glass” access that’s monitored?
Can you rotate keys without downtime (or with an acceptable plan)?

Make the secure path self-serve

Teams bypass controls when the “right way” is slow. Provide standard roles, templates, and automated review checks (IaC scanning, policy linting), and you’ll reduce both risk and friction.

Common mistakes

These are the IAM mistakes that repeatedly show up in real incidents. For each one, the fix aims to reduce blast radius while keeping teams moving. If you’re doing a quarterly review, start here.

Mistake	Why it’s expensive	Practical fix
“Admin for everyone” (wildcard permissions)	One compromised identity becomes full account takeover	Split roles by job function + add guardrails for “never events”
Long-lived access keys for humans	Keys leak and remain valid for months	SSO + short-lived role sessions; disable/delete stale keys
Overly permissive trust policies	Attackers can assume roles without owning a key	Constrain principals and add OIDC/issuer/sub/aud conditions
No MFA for privileged actions	Password reuse/phishing becomes immediate privilege escalation	Require MFA for admins and role assumption; monitor MFA changes
Shared accounts / shared credentials	No attribution; hard to revoke safely	Unique identities, group-based access, session tags for ownership
CI/CD roles too broad	Compromised pipeline turns into production breach	Per-repo roles, environment separation, minimal deploy actions
Permissions to disable logging or security controls	Attackers cover tracks and extend dwell time	Explicit deny/guardrail; central immutable log account/project
Orphaned roles and policies	Old privileges become new attack paths	Owner tags, expiry dates, and automated cleanup/attestation

Mistake: Treating “least privilege” as a one-time project

Teams change. Services evolve. Permissions that were correct six months ago can become either too broad or too narrow.

Fix: schedule periodic attestations and tie ownership to roles/policies.
Fix: use usage data (audit logs) to shrink permissions iteratively.
Fix: keep a documented break-glass path to avoid “attach admin” emergencies.

Mistake: Assuming “resource names” are enough scope

If your environment naming is inconsistent, “scope by ARN” becomes brittle and developers reach for wildcards.

Fix: standardize naming and add tags/labels to enforce ABAC-style access.
Fix: use conditions (tags, prefixes, source identity, VPC endpoints) to add safety.

Mistake: Giving vendors broad access “temporarily”

Temporary integrations often become permanent. Attackers love third-party footholds.

Fix: isolate vendors in dedicated roles with strict trust and resource scope.
Fix: add expiration dates and require re-approval for extensions.
Fix: monitor vendor session activity separately with higher scrutiny.

Mistake: Not separating prod from non-prod

Many incidents start in dev/test and escalate because the same role can touch production.

Fix: separate accounts/projects/subscriptions for prod where possible.
Fix: require stronger controls for prod assumption (MFA, approvals, time limits).

The most dangerous permissions are “IAM + networking + logging”

Many cloud takeovers involve a chain: get any foothold, escalate via IAM, then open network paths and disable logging to persist. Put your tightest controls and monitoring on those permission areas.

FAQ

What does “least privilege” mean in practice?

Least privilege means granting only the actions and resource scope needed for a specific task, ideally with conditions (tags, prefixes, time, network location). In practice, teams implement it iteratively: start with a bounded role, measure actual usage via audit logs, then shrink permissions and add guardrails.

Are long-lived access keys always bad?

They’re high-risk and should be minimized. Some legacy systems and vendor integrations still require long-lived keys, but they should be tightly scoped, rotated routinely, monitored for anomalous use, and stored in a proper secrets manager. For humans and CI/CD, prefer short-lived sessions.

What’s the difference between a permission policy and a trust policy?

A permission policy says what actions are allowed on what resources. A trust policy says who is allowed to assume a role (become that identity). Both matter: a least-privilege role is still dangerous if its trust policy allows unexpected principals to assume it.

How do we secure CI/CD without slowing deployments?

Use workload identity (OIDC) + role assumption instead of stored credentials, create per-repo/per-environment roles, and scope permissions to deployment actions only. This usually improves speed because developers stop requesting manual keys and the pipeline becomes easier to reason about.

What IAM changes should trigger alerts?

Prioritize high-signal events: new access keys, policy changes (especially broad grants), role trust updates, MFA disabled, and any attempt to disable or modify audit logging. These map to attacker workflows and reduce mean time to detect.

How often should we review IAM?

A solid baseline is: continuous checks in CI for IaC, monthly cleanup for keys and unused roles, and quarterly access attestation for privileged roles. If you’ve had an incident, increase cadence temporarily until hygiene is restored.

What’s the best “break-glass” approach?

Keep a small number of emergency accounts/roles with strong MFA, limited access paths, and heavy monitoring. They should not be used for daily operations, and every use should create an incident-style audit trail with explicit approval and post-use review.

Cheatsheet

Use this during design reviews, IaC PRs, vendor onboarding, and incident cleanup. It’s deliberately short and opinionated.

IAM posture checklist

Humans use SSO + MFA; no shared accounts
CI/CD uses OIDC workload identity; no stored cloud keys
Workloads have dedicated service identities per service/environment
Roles/policies have owner + purpose tags and an expiry/attestation schedule
Privileged actions require MFA/approval and time-limited elevation
Guardrails deny never events (disable logs, make sensitive data public, broad IAM changes)
Audit logs are centralized and protected from deletion/modification

Least privilege “rules of thumb”

Scope by resource prefix and/or tags (ABAC)
Prefer explicit allow lists of actions
Avoid wildcards; if unavoidable, add strong conditions
Separate read and write roles; separate prod from non-prod
Keep an emergency path, but make it rare, monitored, and reviewed

Common problem	What it looks like	Fix pattern
Overbroad policy	`*` actions/resources; “Administrator” attached widely	Split roles by job; scope resources; add conditions; apply guardrails
Key sprawl	Old keys, keys with unknown owners, keys in CI secrets	SSO/OIDC; rotate; disable stale keys; require owner tags
Weak trust	Role assumption allowed from broad principals	Constrain issuer/aud/sub; per-repo roles; require MFA for humans
No separation of environments	Dev identity can touch prod resources	Separate accounts/projects; separate roles; stronger prod controls
Poor visibility	Hard to trace actions back to a person/workload	Centralize logs; session tags; alert on IAM changes

A simple rule that scales

If a role’s permissions or trust policy can’t be explained in one sentence (“Repo X deploys Service Y to Prod”), it’s probably too broad. Break it up into smaller roles with clear ownership.

Wrap-up

IAM mistakes are expensive because they multiply risk: one leaked credential becomes access to many systems, and weak trust policies create invisible back doors. The fixes aren’t glamorous—roles, conditions, guardrails, key hygiene, and logging—but they’re the controls that consistently prevent “million-dollar” incidents.

Do this next (today)

Find and remove wildcard policies (or bound them with conditions)
Disable stale human access keys; migrate humans to SSO
Review top 10 privileged roles: permissions + trust policies
Turn on alerts for access key creation and policy/trust changes

Do this next (this week)

Move CI/CD to OIDC workload identity and eliminate stored cloud keys
Add org-wide guardrails for “never events” (logging, public exposure, broad IAM changes)
Create a documented, monitored break-glass path for emergencies
Start monthly IAM cleanup: ownership tags, stale roles, and key rotation

Security that ships

The best IAM program doesn’t rely on perfect humans. It builds safe defaults, self-serve workflows, and guardrails that catch mistakes. If your policy strategy requires everyone to be a cloud IAM expert, it won’t survive growth.

Want to go deeper? The related posts below cover how apps get hacked, threat modeling templates, and DevSecOps practices that make IAM hygiene part of your delivery pipeline—without chaos.

UniLab Editorial

Modern learning notes for practical builders.