On September 12, 2025, Cloudflare experienced a significant self-inflicted outage resembling a distributed denial-of-service (DDoS) attack due to a bug in its dashboard code. The root cause was a coding error involving a React useEffect hook in the dashboard frontend. Engineers mistakenly included a problematic object in the dependency array of this hook, causing it to be treated as always new and triggering it to repeatedly execute during a single dashboard render instead of once.
This flaw led to excessive, repeated API calls to Cloudflare’s Tenant Service API, a critical backend component responsible for API request authorization. The surge in API requests overwhelmed the Tenant Service, causing it to fail and ripple failures to other APIs and the dashboard itself. Without the Tenant Service operational, authorization for API requests could not be evaluated, resulting in widespread 5xx server errors.
The timeline of the outage began when a new version of the Cloudflare Dashboard containing the bug was released at 16:32 UTC. At 17:50 UTC, a new version of the Tenant Service API deployed amid the dashboard issues. By 17:57 UTC, the Tenant API became overwhelmed, causing dashboard availability to drop significantly. Attempts to mitigate at 18:17 UTC by adding resources improved API availability to 98%, but the dashboard remained down. At 18:58 UTC, a faulty update intended to fix errors made the outage worse. These changes were rolled back at 19:12 UTC, restoring full dashboard functionality.
Cloudflare engineers implemented emergency measures including rate limiting and scaling up Kubernetes pods running the Tenant Service to manage traffic load but found these insufficient alone to resolve the problem. The outage’s control plane segregation meant core Cloudflare network services like content delivery and security were unaffected, limiting the impact to configuration and dashboard users.
Cloudflare's post-incident plans include migrating the Tenant Service to Argo Rollouts to enable automated deployment error detection and rollback, which could have limited the duration of the worsening outage. Other planned enhancements are introducing random delays in dashboard retry attempts to avoid thundering herd effects, bolstering Tenant Service capacity to handle spikes, and improving monitoring and alerts to preempt similar incidents.
The company is also making changes to how API call metadata is sent from the dashboard to distinguish retries from new requests, making future issue diagnosis faster. Cloudflare apologized for the disruption and committed to thorough investigations and system improvements to prevent recurrence.
Overall, this incident highlights the complex interplay between frontend bugs and backend service stability, illustrating how a seemingly small client-side coding error can cascade into large-scale API and service disruption in a critical web infrastructure provider.mashable+4