IAM is the control plane for your cloud. When it’s wrong, “one leaked key” turns into admin access, data exfiltration, or a surprise crypto-mining bill. This post breaks down the IAM mistakes that repeatedly show up in breaches and costly incidents—and the practical fixes you can implement without turning your organization into a ticket factory.
Quickstart
If you only have 30–60 minutes, do these in order. They deliver the biggest risk reduction per unit of effort and reduce the most common “million-dollar” failure modes: over-privileged identities and long-lived credentials.
The 6 fastest wins
- Turn on MFA everywhere (human users, especially admins) and lock down break-glass accounts.
- Stop using long-lived access keys for humans; move to SSO + short-lived role sessions.
- Inventory your principals: users, roles, service accounts, CI identities, and third-party integrations.
- Remove “*:*” policies and split admin from day-to-day operator roles.
- Add guardrails (Org policies / SCPs / conditional access) for “never events” like public storage or disabling logs.
- Centralize audit logs (cloud audit + identity provider logs) and alert on key IAM events.
Red flags to search for today
- Policies with
*actions or*resources without strong conditions - Access keys older than 90 days (or never rotated)
- Roles that can be assumed by any principal (wildcard trust)
- “Administrator” permissions attached to CI/CD, bots, or service accounts
- Permissions to disable logging, change IAM, or modify network boundaries broadly
- Shared accounts and shared credentials (no identity attribution)
A strong IAM posture is not only about blocking attackers. It’s also about traceability: every action should map back to a specific identity, a specific session, and a specific justification (role/policy).
Quick decision table
| If you’re doing… | Preferred identity method | Why it’s safer |
|---|---|---|
| Human console/CLI access | SSO + short-lived role sessions | Central control, MFA, and fast revocation |
| CI/CD deployments | Workload identity (OIDC) + role assumption | No stored secrets, scoped to repository/workflow |
| Service-to-service calls | Service identity + role-based permissions | Least privilege with clear ownership and rotation |
| Third-party integrations | Dedicated role + constrained trust policy | Limits blast radius and simplifies offboarding |
Overview
Most “cloud breaches” are not about zero-days. They’re about identity and access: an attacker finds a credential, a token, or a misconfigured trust relationship, then uses it to move laterally and escalate privileges. That’s why IAM mistakes can cost companies millions—sometimes from direct fraud and data loss, sometimes from incident response, regulatory obligations, downtime, and reputational damage.
What this post covers
- The IAM mistakes that show up repeatedly (over-permissioning, trust policy bugs, key sprawl)
- How to fix them with concrete patterns (roles, conditions, guardrails, and auditing)
- How to design IAM so security doesn’t block delivery (self-service with boundaries)
- A scan-fast cheatsheet for ongoing reviews
Who this is for
- Developers shipping cloud workloads and CI/CD pipelines
- Platform/DevOps teams maintaining “shared” infrastructure
- Security teams hardening cloud environments without breaking teams
- Anyone doing post-incident cleanup (and wanting it to stick)
AWS IAM, Azure Entra ID/RBAC, and Google Cloud IAM all implement the same fundamental ideas: principals, roles/permissions, resource scope, and conditions. The patterns below are transferable even if your exact policy syntax changes.
Core concepts
IAM is easier when you think in a simple sentence: “Which principal can do what action on which resource under what conditions?” Every policy system is a variation of this.
Identity vs credentials vs permissions
Identity (principal)
A “who”: user, group, role, service account, workload identity, or external principal (IdP / partner).
- Humans should be unique identities (no sharing)
- Workloads should have dedicated service identities
- Third parties should get isolated identities/roles
Credentials (how you prove “who”)
Passwords, MFA factors, access keys, certificates, OIDC tokens, session tokens.
- Prefer short-lived tokens over long-lived keys
- MFA is mandatory for privileged actions
- Rotate/revoke credentials as an operational routine
Roles and session-based access (the “golden path”)
In mature cloud setups, humans and workloads rarely hold long-lived secrets. Instead they assume roles and receive a time-limited session that carries permissions. This dramatically reduces the blast radius of a leaked credential and makes offboarding simpler.
Two policy planes you must understand
| Plane | Controls | Common bug | What to verify |
|---|---|---|---|
| Permission policy | What actions are allowed/denied | Overbroad actions/resources (wildcards) | Scope, conditions, and explicit denies/guardrails |
| Trust policy (assume role) | Who can become the role | Wildcard principals / weak conditions | Allowed principals, audience, repo/workflow constraints, MFA requirements |
Least privilege (and why teams struggle with it)
Least privilege means granting the minimum permissions needed to perform a task. Teams often fail here because it’s tempting to “just attach admin” to get unblocked. The trick is to create developer-friendly roles that are still bounded: narrow resource scopes, time-limited elevation, and safe self-service.
- Start broad enough to function, but still bounded by resource prefixes/tags.
- Use audit logs to discover the true set of used actions.
- Iterate: reduce permissions, add conditions, and keep a break-glass path.
Guardrails: “deny the truly dangerous stuff”
Guardrails (Org policies / SCPs / conditional access / permission boundaries) enforce global rules even if a team accidentally grants too much. This is how you prevent “never events” like disabling audit logs or making sensitive storage public.
Over time, somebody will attach an overly broad allow. Guardrails are your safety net: a small set of explicit denies for high-risk actions (turning off logs, changing IAM broadly, or opening public access) will save you when policy hygiene slips.
Step-by-step
Here’s a practical hardening path you can run as a lightweight “IAM sprint.” It’s designed to be realistic: you’ll improve posture while keeping teams productive. Each step includes what to do, why it works, and what to watch out for.
Step 1 — Inventory principals and entry points
You can’t fix IAM if you don’t know who exists. Start with an inventory you can review monthly.
What to inventory
- Human users (cloud console + IdP)
- Privileged roles (admins, billing, security)
- Workload identities (compute, k8s, serverless)
- CI/CD identities and third-party integrations
- Long-lived credentials (keys, certificates)
What “good” looks like
- Humans use SSO + MFA; no shared accounts
- Workloads use role/session identity (no embedded secrets)
- Each integration has a dedicated role and owner
- Every principal has tags/labels: owner, purpose, environment
Step 2 — Fix login posture and break-glass access
Before you chase fine-grained permissions, close the obvious doors: missing MFA, shared admin users, and unmanaged “root”/tenant owner access.
Baseline requirements (non-negotiable)
- MFA required for all human accounts; stronger factors for admins
- Break-glass accounts exist, but are locked down and monitored (no daily use)
- Central IdP is the source of truth for access; offboarding is one switch
- Admin actions are limited to dedicated admin roles, not day-to-day accounts
Step 3 — Replace long-lived keys with role/session identity
The fastest way to reduce credential-leak risk is to stop minting permanent access keys. Use SSO for humans and workload identity for automation. This makes secrets scanning less stressful because fewer secrets exist.
Long-lived keys leak through Git commits, CI logs, pastebins, laptops, chat screenshots, and vendor tickets. Short-lived sessions don’t eliminate risk, but they drastically shorten the window and simplify rotation/revocation.
Example 1: A minimal policy (least privilege) for a specific bucket prefix
Don’t hand out “S3:* on *” for a service that only reads from one prefix. Scope actions and resources. Add conditions when you can (tags, prefixes, VPC endpoints, source identity).
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadOnlyFromAppPrefix",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion"
],
"Resource": "arn:aws:s3:::my-company-data/app-a/*"
},
{
"Sid": "ListOnlyWithinAppPrefix",
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::my-company-data",
"Condition": {
"StringLike": {
"s3:prefix": [
"app-a/*"
]
}
}
}
]
}
Step 4 — Lock down trust policies (who can assume roles)
Permission policies get a lot of attention, but trust policies are where attackers slip in. A role with great least-privilege permissions is still dangerous if anyone can assume it.
Trust policy checks
- No wildcard principals for assumption
- Constrain federation (audience, issuer, subject)
- Require MFA for privileged role assumption (for humans)
- Use separate roles per environment (dev/stage/prod)
Common “oops” patterns
- Any GitHub repo can assume the deploy role
- Any workload in the cluster can assume a high-priv role
- Third-party role trust is left open after a pilot
- Role chaining without boundaries (easy escalation)
Example 2: Terraform sketch for CI/CD using OIDC (no stored access keys)
This pattern avoids long-lived secrets in CI by letting the workflow exchange an OIDC token for a time-limited role session. The critical part is the trust policy conditions that scope assumption to a specific repo/branch/workflow.
resource "aws_iam_role" "github_actions_deploy" {
name = "github-actions-deploy-prod"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Principal = {
Federated = aws_iam_openid_connect_provider.github.arn
}
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringEquals = {
"token.actions.githubusercontent.com:aud" = "sts.amazonaws.com"
}
StringLike = {
# Lock to a repo and branch (tighten further to workflow if you can)
"token.actions.githubusercontent.com:sub" = "repo:my-org/my-repo:ref:refs/heads/main"
}
}
}
]
})
}
resource "aws_iam_role_policy" "deploy_permissions" {
role = aws_iam_role.github_actions_deploy.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "AllowDeployToSpecificResources"
Effect = "Allow"
Action = [
"ecs:UpdateService",
"ecs:DescribeServices",
"iam:PassRole"
]
Resource = [
"arn:aws:ecs:us-east-1:123456789012:service/prod/*",
"arn:aws:iam::123456789012:role/prod-ecs-task-role"
]
}
]
})
}
Step 5 — Implement permission boundaries and “just-in-time” elevation
The best way to avoid chaos is to make the secure thing the easy thing. Instead of forcing every team to become IAM experts, provide: pre-approved roles, permission boundaries, and a temporary elevation path for rare admin tasks.
A workable model for most orgs
- Day-to-day roles: scoped per team/service/environment
- Elevation role: time-limited, requires MFA/approval, heavy logging
- Boundaries/guardrails: deny “never events” regardless of team policies
- Ownership metadata: every role/policy has an owner and purpose tag
Step 6 — Kill key sprawl with rotation, detection, and automation
Some keys will still exist (vendor integrations, legacy systems). Make them safe: reduce scope, rotate regularly, and monitor for abnormal use. “Set and forget” is what turns a small leak into a big incident.
Example 3: Find old access keys and disable unused ones (AWS CLI)
This is a simple starting point for cleanup. In production, pair this with approvals and a safe rollback plan (disable first, then delete). Always coordinate with service owners to avoid breaking workloads.
# List access keys and their creation dates for all IAM users
aws iam list-users --query 'Users[].UserName' --output text | tr '\t' '\n' | while read -r user; do
aws iam list-access-keys --user-name "$user" \
--query 'AccessKeyMetadata[].{User:UserName,KeyId:AccessKeyId,Status:Status,Created:CreateDate}' \
--output table
done
# Check last-used date for a specific key (helps find stale keys)
aws iam get-access-key-last-used --access-key-id AKIAEXAMPLEKEYID
# Disable a key (safer first step than deleting)
aws iam update-access-key --user-name some-user --access-key-id AKIAEXAMPLEKEYID --status Inactive
Step 7 — Log, alert, and rehearse response
Even with good policies, you need detection. The IAM events you alert on should map to “attack progress” steps: new credentials, privilege escalation, persistence, and log tampering.
High-signal alerts
- New access keys created (especially for privileged users)
- Policy attached/updated with broad privileges
- Role trust policy changed
- MFA disabled, password policy weakened
- Audit logging disabled or modified
Incident drill checklist
- Can you revoke sessions quickly (SSO/IdP kill switch)?
- Do you know where logs are centralized and immutable?
- Do you have “break-glass” access that’s monitored?
- Can you rotate keys without downtime (or with an acceptable plan)?
Teams bypass controls when the “right way” is slow. Provide standard roles, templates, and automated review checks (IaC scanning, policy linting), and you’ll reduce both risk and friction.
Common mistakes
These are the IAM mistakes that repeatedly show up in real incidents. For each one, the fix aims to reduce blast radius while keeping teams moving. If you’re doing a quarterly review, start here.
| Mistake | Why it’s expensive | Practical fix |
|---|---|---|
| “Admin for everyone” (wildcard permissions) | One compromised identity becomes full account takeover | Split roles by job function + add guardrails for “never events” |
| Long-lived access keys for humans | Keys leak and remain valid for months | SSO + short-lived role sessions; disable/delete stale keys |
| Overly permissive trust policies | Attackers can assume roles without owning a key | Constrain principals and add OIDC/issuer/sub/aud conditions |
| No MFA for privileged actions | Password reuse/phishing becomes immediate privilege escalation | Require MFA for admins and role assumption; monitor MFA changes |
| Shared accounts / shared credentials | No attribution; hard to revoke safely | Unique identities, group-based access, session tags for ownership |
| CI/CD roles too broad | Compromised pipeline turns into production breach | Per-repo roles, environment separation, minimal deploy actions |
| Permissions to disable logging or security controls | Attackers cover tracks and extend dwell time | Explicit deny/guardrail; central immutable log account/project |
| Orphaned roles and policies | Old privileges become new attack paths | Owner tags, expiry dates, and automated cleanup/attestation |
Mistake: Treating “least privilege” as a one-time project
Teams change. Services evolve. Permissions that were correct six months ago can become either too broad or too narrow.
- Fix: schedule periodic attestations and tie ownership to roles/policies.
- Fix: use usage data (audit logs) to shrink permissions iteratively.
- Fix: keep a documented break-glass path to avoid “attach admin” emergencies.
Mistake: Assuming “resource names” are enough scope
If your environment naming is inconsistent, “scope by ARN” becomes brittle and developers reach for wildcards.
- Fix: standardize naming and add tags/labels to enforce ABAC-style access.
- Fix: use conditions (tags, prefixes, source identity, VPC endpoints) to add safety.
Mistake: Giving vendors broad access “temporarily”
Temporary integrations often become permanent. Attackers love third-party footholds.
- Fix: isolate vendors in dedicated roles with strict trust and resource scope.
- Fix: add expiration dates and require re-approval for extensions.
- Fix: monitor vendor session activity separately with higher scrutiny.
Mistake: Not separating prod from non-prod
Many incidents start in dev/test and escalate because the same role can touch production.
- Fix: separate accounts/projects/subscriptions for prod where possible.
- Fix: require stronger controls for prod assumption (MFA, approvals, time limits).
Many cloud takeovers involve a chain: get any foothold, escalate via IAM, then open network paths and disable logging to persist. Put your tightest controls and monitoring on those permission areas.
FAQ
What does “least privilege” mean in practice?
Least privilege means granting only the actions and resource scope needed for a specific task, ideally with conditions (tags, prefixes, time, network location). In practice, teams implement it iteratively: start with a bounded role, measure actual usage via audit logs, then shrink permissions and add guardrails.
Are long-lived access keys always bad?
They’re high-risk and should be minimized. Some legacy systems and vendor integrations still require long-lived keys, but they should be tightly scoped, rotated routinely, monitored for anomalous use, and stored in a proper secrets manager. For humans and CI/CD, prefer short-lived sessions.
What’s the difference between a permission policy and a trust policy?
A permission policy says what actions are allowed on what resources. A trust policy says who is allowed to assume a role (become that identity). Both matter: a least-privilege role is still dangerous if its trust policy allows unexpected principals to assume it.
How do we secure CI/CD without slowing deployments?
Use workload identity (OIDC) + role assumption instead of stored credentials, create per-repo/per-environment roles, and scope permissions to deployment actions only. This usually improves speed because developers stop requesting manual keys and the pipeline becomes easier to reason about.
What IAM changes should trigger alerts?
Prioritize high-signal events: new access keys, policy changes (especially broad grants), role trust updates, MFA disabled, and any attempt to disable or modify audit logging. These map to attacker workflows and reduce mean time to detect.
How often should we review IAM?
A solid baseline is: continuous checks in CI for IaC, monthly cleanup for keys and unused roles, and quarterly access attestation for privileged roles. If you’ve had an incident, increase cadence temporarily until hygiene is restored.
What’s the best “break-glass” approach?
Keep a small number of emergency accounts/roles with strong MFA, limited access paths, and heavy monitoring. They should not be used for daily operations, and every use should create an incident-style audit trail with explicit approval and post-use review.
Cheatsheet
Use this during design reviews, IaC PRs, vendor onboarding, and incident cleanup. It’s deliberately short and opinionated.
IAM posture checklist
- Humans use SSO + MFA; no shared accounts
- CI/CD uses OIDC workload identity; no stored cloud keys
- Workloads have dedicated service identities per service/environment
- Roles/policies have owner + purpose tags and an expiry/attestation schedule
- Privileged actions require MFA/approval and time-limited elevation
- Guardrails deny never events (disable logs, make sensitive data public, broad IAM changes)
- Audit logs are centralized and protected from deletion/modification
Least privilege “rules of thumb”
- Scope by resource prefix and/or tags (ABAC)
- Prefer explicit allow lists of actions
- Avoid wildcards; if unavoidable, add strong conditions
- Separate read and write roles; separate prod from non-prod
- Keep an emergency path, but make it rare, monitored, and reviewed
| Common problem | What it looks like | Fix pattern |
|---|---|---|
| Overbroad policy | * actions/resources; “Administrator” attached widely |
Split roles by job; scope resources; add conditions; apply guardrails |
| Key sprawl | Old keys, keys with unknown owners, keys in CI secrets | SSO/OIDC; rotate; disable stale keys; require owner tags |
| Weak trust | Role assumption allowed from broad principals | Constrain issuer/aud/sub; per-repo roles; require MFA for humans |
| No separation of environments | Dev identity can touch prod resources | Separate accounts/projects; separate roles; stronger prod controls |
| Poor visibility | Hard to trace actions back to a person/workload | Centralize logs; session tags; alert on IAM changes |
If a role’s permissions or trust policy can’t be explained in one sentence (“Repo X deploys Service Y to Prod”), it’s probably too broad. Break it up into smaller roles with clear ownership.
Wrap-up
IAM mistakes are expensive because they multiply risk: one leaked credential becomes access to many systems, and weak trust policies create invisible back doors. The fixes aren’t glamorous—roles, conditions, guardrails, key hygiene, and logging—but they’re the controls that consistently prevent “million-dollar” incidents.
Do this next (today)
- Find and remove wildcard policies (or bound them with conditions)
- Disable stale human access keys; migrate humans to SSO
- Review top 10 privileged roles: permissions + trust policies
- Turn on alerts for access key creation and policy/trust changes
Do this next (this week)
- Move CI/CD to OIDC workload identity and eliminate stored cloud keys
- Add org-wide guardrails for “never events” (logging, public exposure, broad IAM changes)
- Create a documented, monitored break-glass path for emergencies
- Start monthly IAM cleanup: ownership tags, stale roles, and key rotation
The best IAM program doesn’t rely on perfect humans. It builds safe defaults, self-serve workflows, and guardrails that catch mistakes. If your policy strategy requires everyone to be a cloud IAM expert, it won’t survive growth.
Want to go deeper? The related posts below cover how apps get hacked, threat modeling templates, and DevSecOps practices that make IAM hygiene part of your delivery pipeline—without chaos.
Quiz
Quick self-check (demo). This quiz is auto-generated for cyber / security / cloud.