ElevenLabs

Help CentreSubscribe to updates
Powered by
Privacy policy

·

Terms of service
Write-up
Website Slowness and API Request Latency
Partial outage
View the incident

Incident Root Cause Analysis: ElevenLabs Service Degradation (2026.02.25)

Summary of Incident

On February 25th, 2026, between 13:55 UTC and 18:25 UTC, ElevenAgents and ElevenCreative users experienced service degradation, characterized by elevated error rates and general slowness. This primarily affected website requests and a subset of API routes. While Text-to-Speech (TTS), Speech-to-Text (STT), and ElevenAgents conversations remained functional, users experienced intermittent failures when trying to initiate new ElevenAgents conversations via WebRTC.

Revision March 2, 2026: A subset of dubs submitted during the affected period may have been (a) rejected on submission, or (b) reported as failed with a 'timed out' status. All affected dubs will have been successfully refunded.

Detection and Initial Investigation

The issue was initially flagged by a noticeable increase in error rates. The reported errors pointed to database timeouts originating from our cloud provider, leading the on-call engineer to open a high-priority ticket with the provider and escalate the investigation internally. In parallel, our engineers worked on identifying the root cause by reviewing telemetry and recent changes.



Root Cause

The database timeouts were identified as a "red herring." The actual cause was a long-standing, latent bug where resource-intensive analytics dashboard queries were intermittently blocking the backend from processing other requests. While these queries were normally fast and had no prior impact, they began to degrade exponentially during the incident due to insufficient database resources. This increased blocking behavior, ultimately leading to the observed system slowness and subsequent database timeouts.



Remediation and Mitigation

Immediate Actions:

  • The underlying bug was fixed, which meant that the queries would no longer block.

  • Database resources were immediately increased to handle the query load.


Short-Term Monitoring:

  • Additional monitoring was implemented to provide alerts on both request blocking and database resource utilization for these specific queries.


Medium-Term Planning:

  • Engineering teams are currently exploring architectural changes to prevent this type of resource-contention issue from occurring in the future.


Incident Timeline (all times UTC, February 25, 2026)

13:55 - Internal monitoring detected an increase in query latency within backend services. Resource-intensive analytics queries began blocking other backend requests, triggering cascading timeouts across dependent services.

14:15 - Automated alerts fired as the threshold was met. The on-call engineering team began investigating.

14:22 - A formal incident was opened and internal escalation procedures were initiated.

14:38 - Based on initial error signatures pointing to database timeouts, a high-priority case was opened with our infrastructure provider. Additional engineering resources were engaged to investigate in parallel.

15:04 - The incident was initially attributed to an infrastructure provider issue and the status page was updated accordingly. This was later determined to be incorrect; the root cause was internal.

15:37 - Engineering attempted multiple remediations which did not lead to improvements in metrics including latency or error rates. In parallel, our infrastructure provider continued to investigate.

16:40 - While the provider investigation proved inconclusive, our team in parallel continued to research internal causes. Engineers systematically explored multiple remediation paths, including resource scaling, rollbacks, and replica adjustments, while narrowing in on the true source of the degradation.

17:54 - The root cause was identified: resource-intensive analytics queries were blocking backend request processing, causing the cascading timeouts observed across services.

18:07 - A fix was implemented to stop the blocking queries from impacting backend services. Improvements began appearing within minutes.

18:25 - 18:47 - Services showed sustained recovery. The team continued to identify and address all affected query paths.

19:45 - Full service recovery was confirmed and the incident was resolved on the status page.