As we work to improve reliability in the platform, we've been working on distributing our systems across different cloud provider zones and regions. To ensure that users are not disrupted by this work, we have been testing these changes across a variety of scenarios so that we can perform the work without downtime.
One of the reliability changes involves increasing the availability of replicas of certain systems, including the caching layers and priority queueing system. While working on increasing availability of these two systems, there was an unintentional change which disrupted the existing replicas and caused an outage.
To avoid similar incidents from reoccuring, we're working on the following changes:
Reduced dependency on the caching layers, cache misses should only increase latency, without disrupting availability
Redundancy in the priority queuing system to avoid outages when one of the queues goes down
Additional checks to our infrastructure configuration to avoid misconfigurations
Additional testing of configurations across different environment types