Incident Details:
This afternoon an infrastructure issue within our Storage caching layer led to an increase in latency from our TTS service. This will have impacted all TTS calls made between 12:40 UTC and 16:40 UTC, increasing latency significantly for Turbo requests.
Timeline:
12:40 UTC Storage layer starts to exhibit issues, TTS latency begins to slowly increase
13:20 UTC TTS latency continues to increase further and triggers our automated alert, paging our on call engineer, starting the investigation
13:41 UTC Latest release reverted in an attempt to remediate
13:44 UTC Status page updated with incident, notifying customers.
14:59 UTC Suspected issue identified and hotfix released
15:21 UTC Despite initial improvements, it was determined that the hotfix didn’t fully remediate
15:40 UTC Identified issue in our storage layer and remediation actions started
16:40 UTC Remediation completed and normal service resumed
Learnings:
We are reviewing our automated alerting thresholds to ensure our engineers begin investigations more promptly
We are reviewing our incident management process to ensure we can communicate issues in a more timely fashion through our status page.
We are reviewing our internal testing and validation for similar releases affecting or impacting the storage caching system.