Incident Details:
Customers using isolated environments (EU or IN) in conjunction with consolidated billing (i.e. customers that have a second workspace in either the US or in the same region with billing linked between the multiple workspaces), and had usage between 13:05 UTC and 14:05 UTC were unable to access the ElevenLabs platform, resulting in 500 errors, due to a bug introduced with the way that usage was reported across environments.
We identified the issue at 13:29 and began investigating. By 14:05, we reverted the bug that had been introduced, limiting the footprint of impacted users. We then began the process to remediate affected customers. This process took place between 14:50 UTC - 15:10 UTC incrementally returning users to normal service over this period, with all customers having full service resumed by 15:10 UTC.
This incident did not impact any customer without the combination of consolidated billing and isolated data residency workspaces (EU + IN). Thus, any customer using our US environment only (elevenlabs.io) or any customers that have an environment in only one of the data residency isolated regions were not impacted.
Timeline
Wed Aug 27 13:05 UTC Error released to isolated environments
Wed Aug 27 13:29 UTC ElevenLabs identified issue and begin investigation
Wed Aug 27 14:05 UTC Bug reverted, limiting impact to only those customers with usage in last hour
Wed Aug 27 14:50 UTC Remediation begun to correct impacted workspaces
Wed Aug 27 15:10 UTC All workspaces across all regions remediated, continue monitoring for additional impact
Wed Aug 27 15:21 UTC Incident declared resolved.
Learnings and Improvements:
We are auditing our automated alerts on the isolated environments (EU + IN) to ensure that threshold values are better tuned for the expected usage in these regions so that we can identify issues more quickly.
Improving our code review process for tests. Though we do have end-to-end tests for the impacted billing flow, an oversight in our review process meant that changes to this test to account for new functionality left a gap in our coverage.