Infrastructure Drift: How to Detect It and What to Do About It

The Drift Problem

You've got beautiful Terraform code, well-organized modules, everything documented. Then someone makes a "quick fix" in the AWS console, and suddenly your code doesn't match reality.

That's drift. It starts small and grows until you have no idea what's actually running.

Why Drift Happens

The Usual Suspects

Emergency fixes: Production is down, someone fixes it manually
Console convenience: It's faster to click than to write code
Automated processes: Auto-scaling modifies resources
Service integrations: AWS services create resources on your behalf
Lack of access control: Too many people with console access

The Cost of Drift

Security gaps: Hardcoded rules bypassing IaC review
Outages: Terraform destroys manually-created resources
Compliance failures: Auditors find undocumented changes
Lost time: Engineers debugging why environments differ

Detecting Drift

Terraform Plan

Run terraform plan -detailed-exitcode regularly. Exit code 2 means drift detected.

Automated Drift Detection

Set up a scheduled pipeline that runs terraform plan every few hours and alerts on drift.

Third-Party Solutions

Tools like Driftctl, Firefly, env0, and Spacelift provide sophisticated drift detection.

Remediation Strategies

Option 1: Update Infrastructure to Match Code

Run terraform apply to revert manual changes. Warning: This might cause downtime.

Option 2: Update Code to Match Infrastructure

Update your Terraform code to include the change, then verify with terraform plan.

Option 3: Import Unmanaged Resources

Write the resource block, run terraform import, adjust until plan shows no changes.

Option 4: Remove from State

Use terraform state rm to stop managing resources that should be managed elsewhere.

Preventing Future Drift

Technical Controls

Restrict console access using IAM policies
Enforce tags that identify IaC-managed resources
Use Service Control Policies (SCPs)

Process Controls

Document break-glass procedures
Require PR review for all changes
Conduct regular drift audits
Train the team on why IaC matters

Drift Response Playbook

Assess: Is this expected?
Document: Who, what, when, why
Decide: Update code or infrastructure?
Remediate: Make the fix
Verify: Confirm drift resolved
Prevent: How do we stop this recurring?

Key Takeaways

Drift is inevitable; detecting it quickly is what matters
Automated scanning should run at least daily
Prevention through access control is better than detection
Document everything

Zero drift is unrealistic. Quick detection and consistent remediation? That's achievable.