Write-up
Increased Text-to-Speech Latency

Incident Details:

This afternoon an infrastructure issue within our Storage caching layer led to an increase in latency from our TTS service. This will have impacted all TTS calls made between 12:40 UTC and 16:40 UTC, increasing latency significantly for Turbo requests.

Timeline:

  • 12:40 UTC Storage layer starts to exhibit issues, TTS latency begins to slowly increase

  • 13:20 UTC TTS latency continues to increase further and triggers our automated alert, paging our on call engineer, starting the investigation

  • 13:41 UTC Latest release reverted in an attempt to remediate

  • 13:44 UTC Status page updated with incident, notifying customers.

  • 14:59 UTC Suspected issue identified and hotfix released

  • 15:21 UTC Despite initial improvements, it was determined that the hotfix didn’t fully remediate

  • 15:40 UTC Identified issue in our storage layer and remediation actions started

  • 16:40 UTC Remediation completed and normal service resumed

Learnings:

  • We are reviewing our automated alerting thresholds to ensure our engineers begin investigations more promptly

  • We are reviewing our incident management process to ensure we can communicate issues in a more timely fashion through our status page.

  • We are reviewing our internal testing and validation for similar releases affecting or impacting the storage caching system.