AWS DynamoDB US-East-1 Outage Explained

Overview

On October 19th / 20th, 2025, AWS’s Northern Virginia (US-East-1) region experienced a cascading outage affecting numerous services, starting with DynamoDB and spreading to EC2, Lambda, NLB, and others. This post breaks down the technical causes and explores the underlying concepts.

Official summary

Timeline of Events

Oct 19, 11:48 PM PDT - DynamoDB DNS race condition triggers failure
    ↓
Oct 20, 2:25 AM PDT - DynamoDB DNS state manually restored
    ↓
Oct 20, 2:25-2:40 AM PDT - Clients recover as DNS caches expire
    ↓
Oct 20, 2:25-5:28 AM PDT - DWFM congestive collapse
                           (Selective restarts at 4:14 AM)
                           (Leases re-established by 5:28 AM)
    ↓
Oct 20, 6:21-10:36 AM PDT - Network Manager propagation delays
                            (New instances launch but lack connectivity)
    ↓
Oct 20, 5:30 AM-2:09 PM PDT - NLB health check failures
                              (Automatic AZ failover disabled at 9:36 AM)
    ↓
Oct 20, 1:50 PM PDT - Full EC2 recovery

Root Cause: DNS Race Condition in DynamoDB

The Problem

A race condition in DynamoDB’s DNS management system resulted in an incorrect empty DNS record for the regional endpoint (dynamodb.us-east-1.amazonaws.com), making DynamoDB completely unreachable. The automated recovery system failed to repair this invalid state, requiring manual intervention.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    DynamoDB DNS System                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐                ┌──────────────┐           │
│  │ DNS Planner  │───────────────▶│ DNS Enactor  │           │
│  │              │  Creates Plans │ (Instance 1) │           │
│  │ Monitors     │                └──────────────┘           │
│  │ Load Bal     │                          │                │
│  │ Health &     │                          │                │
│  │ Capacity     │                ┌──────────────┐           │
│  └──────────────┘                │ DNS Enactor  │           │
│           │                      │ (Instance 2) │           │
│           │                      └──────────────┘           │
│           │                                │                │
│           ▼                                ▼                │
│  ┌──────────────────────────────────────────────────┐       │
│  │   Route 53 (Updates DNS Records for Endpoints)   │       │
│  │         dynamodb.us-east-1.amazonaws.com         │       │
│  └──────────────────────────────────────────────────┘       │
│                                                             │
│  Note: Single regional DNS plan                             │
│        DNS Enactors run redundantly across multiple AZs     │
└─────────────────────────────────────────────────────────────┘

How the Race Condition Occurred

Note: This is a conceptual illustration of the race condition based on the AWS post-mortem. The exact implementation details are AWS-internal.

Time    DNS Enactor 1 (Delayed)         DNS Enactor 2 (Fast Path)
─────────────────────────────────────────────────────────────────────
T0      Working on plan #10             Waiting
        Gets delayed on one endpoint...

T1      Still delayed...                Picks up newer plan #15

T2      Still delayed...                Rapidly processes all endpoints
                                        ✓ Endpoint A
                                        ✓ Endpoint B
                                        ✓ Endpoint C
                                        ✓ ... all complete
                                        Invokes cleanup process
                                        (deletes plans < #12)

T3      Finally completes its delayed   Cleanup: Deletes plan #10
        work from plan #10 based on     (appears to be old/unused)
        stale view of state

T4      - Interleaving causes Route 53 to end up with
           incorrect empty DNS record
        - Automation fails to self-repair
        - DynamoDB unreachable, requires manual intervention

Code Example: Conceptual Implementation

Note: This is simplified pseudocode to illustrate the concept, not actual AWS implementation.

// Conceptual DNS Enactor Logic
class DNSEnactor {
	async applyPlan(
		planId: number,
		endpoint: string
	) {
		// Check if this plan is newer than the current one
		const currentPlan = await this.getCurrentPlan(
			endpoint
		);
		if (planId <= currentPlan.version) {
			console.warn("Plan is older, skipping");
			return;
		}

		// Apply the plan to Route 53
		// This can take time and might be delayed
		// Problem: No verification that the plan is still valid when we commit
		await this.updateRoute53(endpoint, planId);
	}

	async cleanupOldPlans(latestPlanId: number) {
		// Find and delete plans that are significantly older
		const plans = await this.getAllPlans();
		for (const plan of plans) {
			if (plan.version < latestPlanId - 3) {
				await this.deletePlan(plan.version);
				// Problem: This can delete a plan that's being applied by a delayed enactor
			}
		}
	}
}

// The Bug: Race Between Apply and Cleanup
// Enactor 1: Reads plan #10, starts applying (gets delayed)
// Enactor 2: Reads plan #15, applies quickly, runs cleanup
// Enactor 2: Deletes plan #10 (seems too old)
// Enactor 1: Finally commits plan #10 based on stale view
// Result: Route 53 ends up with inconsistent/empty record

// The Fix: Compare-and-Set (CAS) semantics
class FixedDNSEnactor {
	async applyPlan(
		planId: number,
		endpoint: string
	) {
		const currentPlan = await this.getCurrentPlan(
			endpoint
		);
		if (planId <= currentPlan.version) {
			return;
		}

		// Use atomic compare-and-set when committing to Route 53
		const success = await this.updateRoute53(
			endpoint,
			planId,
			{
				// Only succeed if version unchanged
				ifCurrentVersionIs: currentPlan.version,
			}
		);

		if (!success) {
			console.warn(
				"Plan version changed during apply, aborting"
			);
		}
	}

	async cleanupOldPlans(latestPlanId: number) {
		const plans = await this.getAllPlans();
		for (const plan of plans) {
			if (plan.version < latestPlanId - 3) {
				// Only delete if not actively being applied
				const isActive =
					await this.checkIfBeingApplied(
						plan.version
					);
				if (!isActive) {
					await this.deletePlan(plan.version);
				}
			}
		}
	}
}

Concept Deep Dive: Leases

Technical context

To understand what happened, we need to share some information about a few subsystems that are used for the management of EC2 instance launches, as well as for configuring network connectivity for newly launched EC2 instances.

The first subsystem is DropletWorkflow Manager (DWFM), which is responsible for the management of all the underlying physical servers that are used by EC2 for the hosting of EC2 instances – we call these servers “droplets”.

The second subsystem is Network Manager, which is responsible for the management and propagation of network state to all EC2 instances and network appliances. Each DWFM manages a set of droplets within each Availability Zone and maintains a lease for each droplet currently under management. This lease allows DWFM to track the droplet state, ensuring that all actions from the EC2 API or within the EC2 instance itself, such as shutdown or reboot operations originating from the EC2 instance operating system, result in the correct state changes within the broader EC2 systems. As part of maintaining this lease, each DWFM host has to check in and complete a state check with each droplet that it manages every few minutes.

What is a Lease?

A lease is a temporary ownership claim or lock on a resource. In AWS EC2’s case, it’s the agreement between DropletWorkflow Manager (DWFM) and each physical server (“droplet”) that tracks server state and availability.

Visual Representation

┌──────────────────────────────────────────────────────────┐
│         DropletWorkflow Manager (DWFM) Fleet             │
│      [Multiple DWFM hosts per AZ, each managing          │
│       a set of droplets within that AZ]                  │
└─────────────────┬────────────────────────────────────────┘
                  │ Lease Check (every few minutes)
                  │ "Are you still alive? Ready for work?"
                  │ State checks depend on DynamoDB
                  │
                  ▼
┌─────────────────────────────────────────────────────────┐
│              Physical Servers (Droplets)                │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐ │
│  │ Droplet 1│  │ Droplet 2│  │ Droplet 3│  │ Droplet N│ │
│  │  Lease:  │  │  Lease:  │  │  Lease:  │  │  Lease:  │ │
│  │  Active  │  │  Active  │  │ Expired  │  │  Active  │ │
│  │          │  │          │  │          │  │          │ │
│  │ Status:  │  │ Status:  │  │ Unusable │  │ Status:  │ │
│  │ Ready    │  │ In-Use   │  │ for new  │  │ Ready    │ │
│  │          │  │ (2 VMs)  │  │ launches │  │          │ │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘ │
└─────────────────────────────────────────────────────────┘

How Leases Work

// Conceptual Lease Mechanism
class DropletWorkflowManager {
	private leases = new Map<string, Lease>(); // droplet_id → lease

	// Periodic heartbeat to renew leases
	async renewLease(
		dropletId: string
	): Promise<void> {
		try {
			// Check if droplet is healthy
			const health =
				await this.checkDropletHealth(dropletId);

			// DWFM's state checks depend on DynamoDB
			// During the outage, this call failed due to DNS issues
			await this.completeStateCheckWithDynamoDB(
				dropletId,
				{
					status: health ? "active" : "degraded",
					lastCheckIn: Date.now(),
					expiryTime: Date.now() + LEASE_TIMEOUT, // e.g., 5 minutes
				}
			);

			this.leases.set(dropletId, {
				status: "active",
				expiryTime: Date.now() + LEASE_TIMEOUT,
			});
		} catch (error) {
			console.error(
				`Failed to renew lease for ${dropletId}`
			);
			// Lease will expire if not renewed in time
		}
	}

	// Check if droplet is available for new EC2 launches
	isAvailableForLaunch(
		dropletId: string
	): boolean {
		const lease = this.leases.get(dropletId);
		if (!lease) return false;

		// Lease must be active and not expired
		return (
			lease.status === "active" &&
			lease.expiryTime > Date.now()
		);
	}

	// Handle expired leases
	onLeaseExpired(dropletId: string): void {
		console.warn(
			`Lease expired for ${dropletId}`
		);
		// Mark droplet as unavailable for new instance launches
		this.markUnavailable(dropletId);
	}
}

Why Leases Matter

State Coordination: Ensures DWFM knows which servers are available
Failure Detection: Expired leases indicate unhealthy or unreachable servers
Resource Management: Only droplets with active leases can host new EC2 instances

The Outage Scenario

Normal Operation:
┌──────────────────────────────────────────────────────┐
│ DWFM → Renew Lease → DynamoDB → Success              │
│ (every few minutes)                                  │
│ Droplets available for launches                      │
└──────────────────────────────────────────────────────┘

During Outage:
┌──────────────────────────────────────────────────────┐
│ DWFM → Renew Lease → DynamoDB → DNS Failure          │
│ Cannot update lease state                            │
│ Lease expires → Droplets marked unavailable          │
│ "Insufficient capacity" errors for new launches      │
└──────────────────────────────────────────────────────┘

After DynamoDB Recovery (Thundering Herd):
┌──────────────────────────────────────────────────────┐
│ DWFM attempts to renew 100,000+ leases simultaneously│
│ Queue overload → Processing delays                   │
│ Renewal attempts timeout before completing           │
│ Leases re-expire → More work queued                  │
│ Retries pile up faster than completions              │
│ → Congestive collapse                                │
│ Solution: Selective DWFM restarts + throttling       │
│           Cleared queues and restored progress       │
└──────────────────────────────────────────────────────┘

Cascading Failures

Service Dependency Chain

DynamoDB Root Cause and Chain of Failures
┌────────────────────────────────────────────────────────────┐
│                   Layer 0: Root Cause                      │
│  ┌──────────────────────────────────────────────────┐      │
│  │         DynamoDB DNS Race Condition              │      │
│  │    Incorrect empty DNS record published          │      │
│  │    dynamodb.us-east-1.amazonaws.com unreachable  │      │
│  └──────────────┬───────────────────────────────────┘      │
└─────────────────┼──────────────────────────────────────────┘
                  │
                  ▼
┌────────────────────────────────────────────────────────────┐
│               Layer 1: Direct Dependencies                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │    EC2      │  │   Lambda    │  │   Redshift  │         │
│  │ DWFM leases │  │ Function ops│  │  Query ops  │         │
│  │   expire    │  │   blocked   │  │   blocked   │         │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘         │
└─────────┼────────────────┼────────────────┼────────────────┘
          │                │                │
          ▼                ▼                ▼
┌────────────────────────────────────────────────────────────┐
│         Layer 2: Secondary Effects (Thundering Herd)       │
│  ┌──────────────────────────────────────────────────┐      │
│  │ EC2 DWFM: Massive backlog of lease renewals      │      │
│  │ → Congestive collapse                            │      │
│  │ → Blocks new instance launches                   │      │
│  │ → Network Manager backlog (6:21-10:36 AM)        │      │
│  │    New instances launch but lack connectivity    │      │
│  └──────────────┬───────────────────────────────────┘      │
└─────────────────┼──────────────────────────────────────────┘
                  │
                  ▼
┌────────────────────────────────────────────────────────────┐
│               Layer 3: Dependent Services                  │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │     NLB     │  │    ECS      │  │   Lambda    │         │
│  │ Health check│  │ Container   │  │  Execution  │         │
│  │   failures  │  │   launches  │  │  throttled  │         │
│  │  (network   │  │   blocked   │  │  (priority  │         │
│  │   delays)   │  │             │  │   to sync)  │         │
│  │ Mitigated:  │  │             │  │             │         │
│  │  Disabled   │  │             │  │             │         │
│  │  auto AZ    │  │             │  │             │         │
│  │  failover   │  │             │  │             │         │
│  │  @ 9:36 AM  │  │             │  │             │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
└────────────────────────────────────────────────────────────┘

Key Concepts Explained

1. Thundering Herd Problem

When a resource becomes available after being unavailable, all waiting processes try to access it simultaneously, overwhelming the system.

// Example: What happened in DWFM after DynamoDB recovered
class ThunderingHerdExample {
	async demonstrate() {
		// Simulate expired leases during DynamoDB outage
		const expiredLeases = []; // 100,000+ droplets
		for (let i = 0; i < 100000; i++) {
			expiredLeases.push({
				id: `droplet-${i}`,
				leaseExpired: true,
			});
		}

		// DynamoDB comes back online
		const dynamoDBRecovered = true;

		if (dynamoDBRecovered) {
			// BAD: All requests at once
			await Promise.all(
				expiredLeases.map((lease) =>
					this.renewLease(lease.id)
				)
			);
			// Result: System overload, timeouts, more queued work

			// GOOD: Rate-limited approach with jitter and backoff
			const BATCH_SIZE = 100;
			const MAX_CONCURRENT = 1000; // Global concurrency limit
			let retryDelay = 100; // Start with 100ms

			for (
				let i = 0;
				i < expiredLeases.length;
				i += BATCH_SIZE
			) {
				const batch = expiredLeases.slice(
					i,
					i + BATCH_SIZE
				);

				// Add jitter to prevent synchronized retries
				const jitter = Math.random() * 50; // 0-50ms random delay

				await Promise.all(
					batch.map((lease) =>
						this.renewLease(lease.id)
					)
				);

				// Queue-size-based rate limiting (AWS's planned fix)
				const queueSize =
					await this.getQueueSize();
				if (queueSize > 10000) {
					// Exponential backoff when queue is large
					retryDelay = Math.min(
						retryDelay * 1.5,
						5000
					); // Cap at 5s
				} else {
					retryDelay = 100; // Reset when queue drains
				}

				await this.sleep(retryDelay + jitter);
			}
		}
	}
}

2. Congestive Collapse

When a system tries to process more work than it can handle, processing delays increase, leading to more work being queued, which further exacerbates the problem.

What happened with DWFM: Retries piled up because processing took so long that in-flight leases re-expired before completion, creating more work than could be drained.

Normal: [Request] → [Process] → [Done] (5ms)
        [Request] → [Process] → [Done] (5ms)

Congested:
[Request spiral] → [Queue growing 1000s] → [Processing at max capacity]
    ↑                                          ↓
    └────── [More requests queued ────────┘ (faster than processing)
             (Leases re-expire during       (Creating even more work)
              processing delays)

Result: Queue grows infinitely, response times explode
Solution: Restart DWFM hosts to clear queue + throttle incoming work

3. DNS Management Systems

Large distributed systems use DNS for service discovery and load balancing. Managing thousands of load balancer IPs requires sophisticated automation.

Traditional Service:
┌─────────┐
│ Client  │──────▶ [dynamodb.us-east-1.amazonaws.com]
└─────────┘       ┌──────────────────────────────────┐
                  │ Route 53 DNS Resolver            │
                  │ Returns single IP: 52.94.1.1     │
                  └──────────────────────────────────┘

AWS's Approach (Hundreds of IPs):
┌─────────┐
│ Client  │──────▶ [dynamodb.us-east-1.amazonaws.com]
└─────────┘       ┌──────────────────────────────────┐
                  │ Route 53 DNS Resolver            │
                  │ Returns 50+ IPs for load sharing │
                  │ 52.94.1.1, 52.94.1.2, 52.94.1.3  │
                  │ 52.94.1.4, 52.94.1.5, ...        │
                  │    ... 52.94.1.50                │
                  └──────────────────────────────────┘
                         ↓
                  ┌──────────────────────────────────┐
                  │ Load Balancer Pool               │
                  │ (Hundreds of NLBs across AZs)    │
                  └──────────────────────────────────┘

Note: In practice, Route 53 answers and LB pools are more complex
      (weighted, health-checked, versioned). Numbers simplified for clarity.

AWS’s Planned Fixes

1. DynamoDB DNS System

// Before (Race condition bug)
class DNSEnactor {
	async cleanupOldPlans(latestPlanId: number) {
		// Deletes any plan significantly older
		// Doesn't check if plan is currently being applied
		const plans = await this.getAllPlans();
		for (const plan of plans) {
			if (plan.version < latestPlanId - 3) {
				await this.deletePlan(plan.version);
			}
		}
	}
}

// After (Partial Fix - still has a subtle race condition!)
class FixedDNSEnactor {
	async cleanupOldPlans(latestPlanId: number) {
		const plans = await this.getAllPlans();
		for (const plan of plans) {
			if (plan.version < latestPlanId - 3) {
				// Check if not actively being applied
				const isActive =
					await this.checkIfBeingApplied(
						plan.version
					);
				// RACE CONDITION: Between this check and the delete below,
				// another enactor could start applying this plan!
				// This is a TOCTOU (Time-of-Check to Time-of-Use) bug
				if (!isActive) {
					await this.deletePlan(plan.version);
				}
			}
		}
	}
}

// Better Fix: Atomic check-and-delete operation
class BetterDNSEnactor {
	async cleanupOldPlans(latestPlanId: number) {
		const plans = await this.getAllPlans();
		for (const plan of plans) {
			if (plan.version < latestPlanId - 3) {
				// Atomically delete only if not in use
				// This combines the check and delete into a single atomic operation
				const deleted =
					await this.deleteIfNotInUse(
						plan.version
					);
				if (deleted) {
					console.log(
						`Cleaned up plan ${plan.version}`
					);
				}
			}
		}
	}

	// Atomic operation: check and delete in one transaction
	async deleteIfNotInUse(
		planVersion: number
	): Promise<boolean> {
		// This would be implemented as a single database transaction or
		// using conditional delete with DynamoDB's ConditionExpression
		// Example: DELETE WHERE version = X AND inUse = false
		return await this.db.conditionalDelete({
			version: planVersion,
			condition: { inUse: false },
		});
	}
}

Important: Even the “partial fix” above has a subtle TOCTOU (Time-of-Check to Time-of-Use) race condition between checking if a plan is active and deleting it. In a multi-threaded environment with multiple DNS Enactors, another enactor could start applying a plan in the tiny window between the check and the delete.

The better approach is to use atomic operations that combine the check and delete into a single transaction. In AWS’s case, this might use DynamoDB’s conditional writes, distributed locks (like with Redis or ZooKeeper), or database transactions that guarantee the check and delete happen atomically.

This is a great example of how fixing one race condition can introduce another if you’re not careful. Distributed systems are hard!

2. NLB Velocity Controls

Limit how quickly health check failures can remove capacity to prevent overreaction to temporary issues.

3. EC2 Improvements

Enhanced testing for DWFM recovery workflows
Better throttling mechanisms for data propagation queues
Queue-size based rate limiting

Lessons Learned

Note: All examples are using Typescript / Node.js for learning purposes only. Do not use this code in production without understanding atomic operations.

Most of the above race conditions, thundering herd and other cascading issues mentioned are very much only ever visible in high scale systems, and even more so in AWS’s case.

I’ve only ever once seen the thundering herd problem in production when I was on-call at DAZN. Maybe thats a blog post for another day!

What have we learned?

Race conditions. You can build the most sophisticated system in the world, but if two pieces of code can step on each other’s toes at the wrong time, it can cause unexpected and unknown issues. The DynamoDB DNS issue was a subtle timing bug that only appeared under specific conditions.

The thundering herd problem. When DynamoDB came back online, DWFM tried to renew 100,000+ leases all at once. The system collapsed under its own recovery attempts. Jitter, backoff, queue-size based throttling, are common techniques to counter this.

Not everything can be automated (can it?). The automation failed to fix the DNS issue. Engineers had to step in, disable systems, restart services, and manually restore things.

Cascading failures. DynamoDB broke, which broke EC2 leases, which triggered a thundering herd, which overloaded Network Manager, which broke NLB health checks. Each failure created the conditions for the next one.

Complexity. AWS have a large scale and complex DNS management system with planners and enactors running across multiple availability zones for resilience. The race condition created an invalid state that the system couldn’t recover from.