An incident caused by our cloud provider in their networking layer resulted in errors and increased latency for Elevenlabs API Requests between 02:36 UTC and 04:02 UTC.
The network connectivity issues caused some API requests to fail from packet loss and also some of our caching layer to become unhealthy due to overloading from many erroneous retries.
Our automated alerts were triggered due to the increased error rate and our engineering team began investigation into the issue, discovering the cause to be due to the cloud provider outage and identifying the impacted cache.
Our team were able to partially remediate the issue at 03:30 UTC, reducing error rates significantly from 41% to 8%, by restoring functionality to the cache that had been impacted by the network issues. The elevated error rates were then fully resolved when network services were fully restored by our cloud provider.
Timeline:
02:36 UTC - Error rate starts to increase for TTS routes
02:41 UTC - Automated alerting alerted our engineering team
03:11 UTC - Status page updated with communications to users
03:33 UTC - Remediative action on our caches significantly improves error rate from 41% to 8%
04:02 UTC - Incident resolved fully and TTS error rates back to normal
Learnings:
We are in the process of making modifications to our incident management processes to ensure we can communicate issues in a more timely fashion through our status page
We have already made resiliency improvements to our caching layer, to reduce impact from such events in the future by modifying our retry strategy.
We are pushing our cloud service provider for better visibility and communication during outage periods