Cast Cloud Incident (Oct 2-7, 2017)
We have written an incident summary on this outage, and have published it here.
Incident Update (Oct 17; 5pm ET): Our file host's error rates and performance continues to be within our expected ranges, but we're still waiting on word that they've completed their filesystem audit, which we understand is a lengthy process. Until that point, we'll continue our intensive monitoring ouf their systems. We're also working on our post-outage analysis, which we'll post once it's complete.
Incident Update (Oct 14; 2pm ET): While our file-host's audit continues, our monitoring systems are now showing that there are no more outstanding files to be restored. We'll continue to monitor and update this issue until our host indicates that their audit has completed.
Incident Update (Oct 13; 3pm ET): Our file host continues running their post-migration audit, which is finding and moving into place any remaining files that cannot be reached. While we'd clearly hoped to have this completely concluded by now, we are seeing progress restoring the remaining few files, and are happy with the ongoing stability of the host.
Incident Update (Oct 12; 1pm ET): According to our most recent debrief with our file host, their migration process is complete and they are currently running an audit on the entire file store. We are monitoring the remaining handful of files that will be restored by this audit, as well as the overall health of the system, which has remained in our acceptible ranges for stability.
Incident Update (Oct 11; 1pm ET): While we've seen the restoration of nearly every unreachable file, we've yet to receive the green light from our file host that their repair operation and verification is complete. As always, we've asked for more information and are waiting on that.
Incident Update (Oct 10; 2pm ET): We're continuing to see those final files trickle back in, but – in keeping with everything else here – that process is taking longer than we'd hoped. That said, it does appear that we're in the home stretch here. For those asking: we will certainly provide a thorough debrief in the coming weeks on everything that transpired here, including what actions we'll be taking in future to avoid anything like this again.
Incident Update (Oct 9; 1pm ET): While we're indeed seeing the handful of outstanding files return, this process is taking longer than our filehost had estimated it would take. We've requested an updated timeline and more info about the delay.
Incident Update (Oct 8; 2pm ET): Cast has been out of maintenance mode and running well for 24hrs now. We are still waiting on our host's restoration of those last few files, but otherwise everything's been running as expected.
Incident Update (Oct 7; 5pm ET): Cast has been out of maintenance mode for a few hours now and generally seems to be running well. We're seeing slightly higher error rates than normal on our file host, but this is to be expected until their rebalancing and repair process concludes.
Incident Update (Oct 7; 2pm ET): We're currently taking the Cast site out of maintenance mode, which will see it return to nearly full functionality shortly. New recordings and the vast majority of preexisting recordings will function normally, but, for a small number of preexisting recordings, users will be unable to mix or download their files until our host completes restoration. They've estimated that this last fraction of data may take another day to fully restore. Please be gentle as the site comes back up.
Incident Update (Oct 6; 11:30pm ET): We're continuing to observe file reachability & stability return to normal, but we're not yet at 100% and don't want to declare victory prematurely. We're standing by, ready to turn the lights back on as soon as we get the green light.
Incident Update (Oct 6; 7pm ET): Continued monitoring throughout the day has shown that file reachability has indeed been returning, and the stability of the file store also seems to be returning to normal as well. We'll continue to monitor the progress of the restore.
Incident Update (Oct 6; 2pm ET): We've just debriefed again with our file host – they're seeing speedy progress in restoring file reachability, and similar progress improving stability. They believe the process is nearing completion but (unsurprisingly) have not provided an ETA.
Incident Update (Oct 6; 11:30am ET): Thanks, everybody, for your messages of support. We've yet to receive another update from our host, but not for lack of trying. We do know that they're installing additional hardware today to better handle the load throughout this repair process and to speed the process along. As soon as we know more, we'll provide another update.
Incident Update (Oct 6; 12am ET): Our file host is reporting that their tool has significantly increased the rate at which they're able to process and restore reachable data, but has still not given us an ETA for completion. Suffice it to say, we're not satisfied with that, and have made that clear, and we await a further update.
Incident Update (Oct 5; 6:30pm ET): We've debriefed again with our file host on the status of their fix: they have developed a tool that is identifying unreachable data and fixing it, and that tool is currently operational. We have asked for an estimate for when its work will complete.
Incident Update (Oct 5; 3pm ET): We continue press our file host for information regarding their fix and an ETA for Cast's file store will return to service.
Incident Update (Oct 5; 11am ET): Our file host has developed a fix that will allow them to manually prioritize access to Cast's data, and will be deploying it shortly if tests succeed.
Incident Update (Oct 5; 12am ET): We've just debriefed with our cloud storage provider; they'll be working through the night tonight to restore the affected systems – specifically a solution that will identify all Cast data and allow it to bypass the larger outage. We'll update again as soon as we know more.
Incident Update (Oct 4; 8pm ET): We have not received another update from our host, despite frequent requests for info. Again, we can't apologize deeply enough for this – this is our worst nightmare, and we're doing everything we can to get back up and running as quickly as possible. Until the file host comes back, however, our hands are unfortunately tied.
Incident Update (Oct 4; 5pm ET): Our file host continues to work on resolving the issue, but we're still without an ETA.
Incident Update (Oct 4; 3pm ET): We're seeing affected systems returning to normal, and hope to restore site functionality shortly once we verify that it's all working as expected.
Incident Update (Oct 4; 1pm ET): Our file host has indicated to us that they're aware of the issue and working toward a resolution. Unfortunately we have not been given an ETA at this time, but will continue to monitor the situation.
Incident Update (Oct 4; 11am ET): Our monitors have detected a renewed service interruption at our host, which has forced us to put the site into maintenance mode. We are frustrated beyond belief with this, and have requested an urgent update from our hosting provider.
Incident Update (Oct 3; 11pm ET): We're continuing to see systems return to normal, but we're not at 100% yet and have not received another ETA from our file host. We'll continue to monitor the situation and update here again when we have more info.
Incident Update (Oct 3; 2pm ET): The latest word from our file host is that they expect the majority of their systems to return to normal over the course of the day today. We'll continue to monitor the service and update again when we know more.
Incident Update (Oct 3; 1pm ET): Unfortunately our file host has not yet resolved the disruption. Until they do, mixes and recording downloads will often fail to complete, and it will be difficult to start new recordings. We're deeply sorry about this disruption, and doing everything we can to have things up and running as quickly as possible. As soon as we have another time estimate, we will update this document.
Service Disruption (Oct 2; 5pm ET): We're seeing an ongoing service disruption with our file host, which is resulting in mixes and recording downloads often failing. We're working with the host to help them resolve this issue as quickly as possible, and have been advised to expect it to be fully resolved by tomorrow morning. Sorry about the inconvenience!