Cast Cloud Incident Summary – Oct 2017
As a result of outages with Cast's file host, between September 20th 2017 and October 14th 2017, a small number of Cast user recordings were not mixable or downloadable, and between October 4th and October 7th we were forced to place Cast into maintenance mode.
While we can't prevent outages with the 3rd-party infrastructure vendors we rely upon to run Cast, we are deeply sorry about this outage, and we want to reiterate our commitment to making sure that Cast is a reliable service. As a result of this outage we are working to study and improve Cast's points of failure to reduce the impact and severity of such an outage on Cast in the future.
What happened?
On September 20th 2017, our automatic monitoring systems detected that a small number of Cast user mixes and file downloads were failing to complete. We investigated the issue and determined that it was a problem with our file hosting provider, and reached out to them for more information. They indicated that this problem was an unintended side-effect a data migration they had underway, and reported that they expected it to be resolved quickly. While the majority of the affected files resolved over the subsequent days, we continued to monitor a small number that remained affected.
On October 4th 2017, we detected a major disruption that was affecting all files stored with our hosting provider, and we put Cast into maintenance mode as a result. Despite initial indications from our hosting provider that this would be resolved shortly, their platform did not stabilize until Oct 7th, at which point we were able to take Cast out of maintenance mode and resume regular service. A small number of files from the initial and subsequent outage remained unreachable, and we monitored these files as they returned to reachability over the subsequent days.
How did we respond?
Following the initial outage on September 20th, we increased our system monitoring and worked around the clock to manually restore backup copies of user recordings where we had them. Cast runs periodic backups where we take snapshots of recordings and store them with an additional hosting provider, and in nearly all cases we were able to restore unreachable files from these backups. In a few cases, user recordings had not yet been backed up at the time of the outage, and these specific recordings were unable to mix or download until the file restoration completed with our file host.
Following the acute outage on October 4th, we put Cast into maintenance mode to prevent data loss for new recordings. During this time we maintained frequent contact with our file host and ran extensive monitoring on their system. Throughout the outage we set a target of providing multiple daily updates, whether we had new information to share or not. We also increased both the number of people working to provide customer support, as well as our support hours, with our support desk continuing to run nearly around the clock throughout the week and the weekend.
What will we do to mitigate problems like this in the future?
While Cast generates off-site backups of recordings, these backups are captured periodically and don't represent the totality of Cast's file store. These backups were, instead, intended as insurance against things like an accidental recording deletion. Further, the periodic nature of these backups means that, at any given time, there exist new recordings that have not yet been backed up off-site.
Going forward, we will be working to improve the latency between recording completion and backup, and reduce our reliance on a single file host, as well as building automatic redunancy into our file store. Further, we aim to build a hot standby for our file storage system, meaning that we can rapidly respond to an outage of this nature and resume regular service much more quickly without relying on any specific vendor for service restoration.