Write-up
TTS Elevated Error Rate
Incident Report - 6th August 2025 

Incident Details:

An incident caused by our cloud provider in their networking layer resulted in errors and increased latency for Elevenlabs API Requests between 02:36 UTC and 04:02 UTC. 

The network connectivity issues caused some API requests to fail from packet loss and also some of our caching layer to become unhealthy due to overloading from many erroneous retries. 

Our automated alerts were triggered due to the increased error rate and our engineering team began investigation into the issue, discovering the cause to be due to the cloud provider outage and identifying the impacted cache.

Our team were able to partially remediate the issue at 03:30 UTC, reducing error rates significantly from 41% to 8%, by restoring functionality to the cache that had been impacted by the network issues. The elevated error rates were then fully resolved when network services were fully restored by our cloud provider.

Timeline:

  • 02:36 UTC - Error rate starts to increase for TTS routes

  • 02:41 UTC - Automated alerting alerted our engineering team

  • 03:11 UTC - Status page updated with communications to users

  • 03:33 UTC - Remediative action on our caches significantly improves error rate from 41% to 8%

  • 04:02 UTC - Incident resolved fully and TTS error rates back to normal

Learnings:

  • We are in the process of making modifications to our incident management processes to ensure we can communicate issues in a more timely fashion through our status page

  • We have already made resiliency improvements to our caching layer, to reduce impact from such events in the future by modifying our retry strategy.

  • We are pushing our cloud service provider for better visibility and communication during outage periods