A number of users were affected by a large outage on one of our servers over the weekend. This FAQ is designed to answer users questions about what happened and what we’re doing in the future.
Q1. Why isn’t there any redundancy that would have avoided this?
A1. There is! We use a RAID storage system that ensures that individual disk failures do not result in any downtime. There are different levels of RAID, and we use one of the newest and most advanced called RAID 6. RAID 6 is designed to continue normally even if 2 hard drives fail simultaneously. In fact, we’ve had several disk failures over the last couple of years, and all of these have gone entirely un-noticed by users due to the RAID system.
Q2. So what did happen?
A2. The drives we use have a guaranteed lifetime of 3 years and were only 15 months old. Given that RAID 6 can support up to 2 drives in an array failing, the chance of any 2 drives failing at the same time is an extremely rare occurrence. However in this particular case, 3 drives all failed within a remarkably short period of time! At that moment, we had effectively lost access to all data on the unit, and had to resort to our disaster recovery scenario, our daily incremental backups.
Q3. How long did it take to replace the first failed drive?
A3. We had the first failed drive replaced within about 30 minutes. As soon as we replaced the drive, it began “rebuilding” the new drive which is the standard RAID process for copying redundant version of the data onto the drive. This process may take over 24 hours however depending on the load on the array. The rebuild was not complete when the 2 other drives then failed.
Q4. Why did you do the largest accounts last?
A4. Most of the accounts on the server were paid accounts, so after we calculated that the entire restore would take about 2-3 days, decided to try and get as many users back up and running as quickly as possible. Because of the large skew in account sizes, if we did largest first, only 10% would be restored after 2 days. By doing the smaller first, we were able to restore 90% of account within the 24 hours or so of the failure.
Q5. Why didn’t you deliver email and restore at the same time?
A5. We did some experimentation and discovered that running both restores and deliveries at the same time seemed to vastly slow down both actions. Something that would take 2 days (restore) + 6 hours (delivery) separately was going to take > 5 days when run simultaneously. Thus we decided to perform the restores as quickly as possible, and as soon as that was done, deliver the queued email.
Q6. Why was email delivery delayed?
A6. During the restore process, we restored users to a number of different servers to speed up the restore process, rather than just restoring back to the original server. Due to the above mentioned problem when simultaneously restoring and delivering email, we suspended all email deliveries for a short time over the weekend.
Q7. Why did it take so long?
A7. Basically the large volume of data. We had to restore > 1000 GB of data, and that just takes a while even restoring to multiple servers simultaneously. We knew that restoring from backups would probably take a while, but we regarded this acceptable because it would only be required in the unexpected base of a total array failure, which would either require complete hardware failure or >2 drives failing at once, something that we’d calculated as extremely unlikely.
Q8. Did anything go right at all?
A8. Yes. We did actually have a "disaster" recovery plan in the event that an entire RAID array and or file-system became destroyed/corrupted. In this very unlikely event, we were actually still ready for it. Our nightly incremental backups of all email did work and were all there and we were able to restore all users from these backups. We had multiple drives spare and ready to go and so could bring a fresh RAID volume back online quickly to restore to. We
were able to get the entire restore done over the weekend and be all ready in time for the working week.
Q9. Was any email bounced?
A9. No. All email was queued and correctly delivered to accounts once they were restored.
Q10. What are you doing to make sure it never happens again?
A10. While the RAID infrastructure we have has worked extremely well for over 5 years, this recent failure has made us again have to look at alternatives in the future. The most likely system would be a replicated infrastructure, where all email and all actions are replicated across multiple servers simultaneously. Such a setup has recently become feasible because of some updates to our email server software, and thus we will now seriously look at this option and what would be required to make it happen. In such a scenario, any one entire system can fail entirely and we can still work just by switching all incoming connections and email to the working system.
Also, one of the core problems was the sheer amount of time it took to recover over 1000GB of data. A replicated solution would help in that even a wholesale system crash or corruption would just result in the alternate replica taking over, but to improve things even further we would reduce the total size of volumes we build, and instead use a number of smaller volumes so even disaster recovery restores would be quicker.
We will also look at using smaller volumes, and drives from different manufacturers and batches to ensure that failures that require restores from backup can be done faster as well.
Q11. Are you going to compensate users for the outage?
A11. We are going to give all affected Full and Enhanced users an additional month on their subscription. This will be occurring shortly.