Our October 20 Service Disruption: What Happened, What We Learned and How We’re Improving

On October 20, 2025, LaunchDarkly experienced a widespread service disruption that impacted feature flag management, feature flag delivery, as well as our guarded releases, observability, experimentation, analytics products and other data features.

The incident began as a result of a major AWS us-east-1 outage and was compounded by internal cascading failures that extended LaunchDarkly’s recovery period.

At LaunchDarkly, our mission states:

Software powers the world. LaunchDarkly powers teams to release, measure, and control software safely and at scale.

During this disruption, we were unable to fully uphold our mission to help teams control their software safely at scale. We take that responsibility seriously and are committed to preventing a recurrence. Here is an overview of what happened during this disruption and what actions we’re taking to improve.

All times in this post are in Pacific Time (PT).

What Happened and What Caused It

The incident unfolded in two phases.

The first phase (between October 19, 11:50 PM and October 20, 11:40 AM) began late on October 19, when AWS us-east-1 experienced a major outage.

Between 11:50 PM and 11:40 AM the next day, AWS services such as EC2, Lambda, DynamoDB, and Route 53’s control plane were degraded or unavailable.

LaunchDarkly depends on these services, and many of our own capabilities were affected:

The LaunchDarkly web application and API in our US commercial environment became increasingly unstable, unable to autoscale out in response to increasing traffic, leading to gradual degradation and non-availability. During this time, impacted customers would not have been able to log in to the web application or use our APIs to manage feature flags.
Flag delivery updates were delayed or unavailable. Client-side streaming SDKs in our US region experienced the biggest impact and were unable to receive flag updates. Client-side streaming SDKs in the EU and APAC regions, server-side streaming SDKs globally, and all polling SDKs globally were operational with error rates at typical levels.
Event ingestion gradually degraded and resulted in data loss for both US and EU environments until 3:00 PM after which data loss stopped due to recovery.

By 11:40 AM, as AWS’s own recovery progressed, LaunchDarkly’s web application and API returned to normal operation.

However, a second phase (between October 20, 11:40 AM and October 21, 12:05 AM) of disruption began shortly after. At 11:48 AM, while we were stabilizing our systems, an internal change intended to reduce load on our web application triggered an unexpected failure in our flag delivery network, which distributes flag updates to SDKs. Specifically, we reverted to a legacy routing path with cold caches. This caused excessive retries from SDKs. The resulting traffic overwhelmed the streaming service and its load balancer, which then became unresponsive. Our infrastructure couldn’t scale out due to ongoing EC2 provisioning issues and it led to an extended outage.

Specifically, server-side SDKs across all regions experienced connection errors, reaching ~99% globally. The EU region recovered quickly, and APAC followed by mid-afternoon, but U.S.-based streaming remained unavailable until late that night. By 12:05 AM on October 21, all streaming services had fully recovered, and by 2:45 AM, we resolved the public incident on our status page.

Overall, across both phases, our US commercial environment was most affected with degradation or non-availability of the main application, flag delivery failures, and event data loss. Our Federal environment only experienced moderate delays in flag delivery and some data loss that was later recovered. Our EU environment remained unaffected.

In summary, the AWS outage caused a widespread initial failure, and our own system design and recovery actions extended the duration of impact.

How We’re Improving

We’ve already made substantial improvements to our systems, and more are underway. Our goal is clear: to help ensure that a regional cloud outage cannot cause the same kind of prolonged disruption again.

1. Making Flag Delivery More Resilient

We have decoupled our Flag Delivery Network from the feature management application so that disabling update notifications will not impact SDK connections.

We’ve also scaled out load balancers in all regions for our flag delivery network to help ensure additional capacity.

Finally, we’re accelerating our ongoing migration to a new fault-tolerant delivery architecture, which initially kept server-side streaming stable before we turned it off.

2. Improving SDK Behavior

In upcoming releases, LaunchDarkly SDKs and the Relay Proxy will support automatic failover from streaming to polling mode if streaming becomes unavailable. This means applications will be able to continue evaluating feature flags seamlessly during outages.

3. Strengthening Multi-region availability and Disaster Recovery

We are relocating all disaster recovery orchestration systems out of us-east-1.

We are investigating multi-region improvements across our most critical services as well as vendor services.

We are improving our quarterly DR testing to incorporate more gameday realism.

4. Communicating More Effectively During Incidents

While we were able to successfully help many customers remain unimpacted or recover quickly, we’re updating our communications process to surface and share potential customer workarounds earlier.

Closing

We know that many of you rely on LaunchDarkly as critical infrastructure for your software delivery pipelines. If you were impacted, we deeply regret the disruption this event caused to your teams and customers.

Our engineering teams have already implemented multiple resiliency improvements and have additional work underway to prevent a recurrence. We’re continuing to address this by hardening our flag delivery network, expanding regional redundancy across critical services, and evolving our architecture to withstand regional failures in critical areas of our stack. We are committed to earning and maintaining your trust through transparency, continuous learning, and measurable reliability improvements.

Thank you for your patience and partnership.

Like what you read?

Get a demo

Sonesh Surana

SVP, Engineering, LaunchDarkly