On Friday we had one of the worst outages we’ve had in over 3 years. For at least 3 hours, all accounts were inaccessible (web, IMAP and POP), and for a few users, it was several hours longer than that. For some other users, there were additional mailbox and quota problems after that.
Obviously this is something we never want to happen, and over the years we’ve setup many systems to avoid outages like this occurring.
A small “trivial” change to a configuration program that was rolled out caused a cascading series of events that resulted in some important files being corrupted. We had to take down all services and rebuild the corrupted files from the last backup and add in any changes since the backup. Once rebuilt, we were able to bring back up all services. A separate corruption issue that affected a few users caused some longer outages and quota issues until we fixed those mailboxes.
We’ve identified the chain of events that caused the problems. Because it’s actually a chain of events, we’ve identified at least 5 separate issues to fix, so that this problem, or something similar to it won’t happen again, and we’ll be doing those over the next week.
Of course we can never be 100% sure that there aren’t other cascade event paths that will cause outages, but by learning from past mistakes, fixing the known problems, continuously enhancing our test infrastructure, and being more aware of possible consequential errors in the future, we’re always aiming to minimise the chance of them occurring and providing the highest reliability possible.
Although it’s draining and frustrating to see a large problem like this occur that so badly affects our users, it’s been fascinating to actually investigate this problem. What we end up seeing is how a set of little mistakes, bad timing, decisions made long ago, and human error (all of which are wonderfully obvious in hindsight) end up causing a much bigger problem than the initial trigger would ever suggest.
The domino effect
In this case, the problem stemmed from a cascading sequence of issues that started with a single misplaced comma.
- Cyrus configuration file error. The underlying trigger of the cascade was a single misplaced comma in a configuration file. In this case the error was detected by the developer during testing, who fixed the problem, but unfortunately pushed the fix to a different branch.
- Core dump behaviour of fatal(). The effect of the broken configuration file was that immediately after forking a new imapd process, it would try to parse the configuration file, fail to do so, and call the fatal() function. Normally that would just cause the process to exit. However our branch of cyrus has a patch we added that means all calls to the fatal() function dump core instead of just exiting; this is normally very useful for debugging and quality control.
- Kernel configured to add pid to core files. We also configure our kernels with the sysctl kernel.core_uses_pid=1 which ensures that each separate process crash/abort() generates a separate core file on disk rather than overwriting the previous one. Again this is very useful for debugging.
- Cyrus master process doesn’t rate limit forking of child processes. The cyrus master process that forks child processes doesn’t do enough sanity checking. Specifically, if an imapd exits immediately after forking, the master process will happily immediately fork another imapd, despite there being zero chance that the new imapd will do any better. At the very least this leads to a CPU-chewing loop (as well as a non-functional imapd) as each forked imapd process immediately exits and the master creates a new one.
- Core files end up in cyrus meta directory. Cyrus supports the concept of separating meta-data files from email data. This is very useful as it allows us to place the small but "hot" meta data files on fast (but small) drives (eg. 10k/15k RPM drives, or SSD drives in new machines), and place the email data files on slower and much larger disks. The "cores" directory where core dumps end up is located on the same path as the meta data directory.
- Cyrus skiplist database format can corrupt in disk full conditions. Cyrus stores some important data in an internal key/value database format called skiplist. The most important data is a list of all mailboxes on the server. This database format works very well for the way cyrus accesses data, it’s been very fast and robust. However it turns out the code doesn’t handle the situation where a disk fills up and writes only partially succeed, causing database corruption.
Putting all the above together, creates the disaster. A small configuration change was rolled out. Every new incoming IMAP connection would cause a new imapd to be forked and immediately abort and dump core. Each core file would end up with a separate filename. This very quickly caused the cyrus meta partitions to fill up. They reached 100% full before we fully realised what was happening. This caused changes to the mailboxes database to only partially write, causing a lot of them to become corrupted.
When we realised this is what had happened, we quickly stopped all services, undid the change, and tried to recover the corrupted databases. Fortunately the databases are backed up each half hour, and there are replicas as well. Using some libraries, we were able to quickly put together code that pulled any still valid records, records from the backup, and records from the replicas and combined them, and rebuilt the mailboxes databases, and then started everything back up.
Adding insult to injury
Fortunately for most people, the mess stopped there. Unfortunately for a few users, there were some additional problems as well.
As well as the mailboxes database corruptions, it was discovered that the code that maintains the cyrus.header and cyrus.index files also didn’t like the partial writes that disk full conditions generate. This caused a small number of mailboxes to be completely inaccessible (Inbox select failed).
Fortunately cyrus has a simple utility to fix corruptions of this form called "reconstruct", so we ran that to fix up any broken mailboxes. Fixing up a mailbox with reconstruct however doesn’t fix up quota calculations, and that has a separate utility "quota" that you can run with a –f flag to fix quotas. We ran that on users to make sure all quota calculations were correct.
Unfortunately there’s a bug in the quota fix code that in some edge cases can double the apparent quota usage of users. This caused a number of accounts to have an incorrect quota usage to be set on their account, and in some cases caused them to go over their quota, causing new messages to be delayed or bounced.
The little test that cried wolf
However the story wouldn’t be complete without the additional human errors that let this happen as well. Thanks to help from the My Opera developer Cosimo, we internally have a Jenkins continuous integration (CI) server setup. This means that on every code/configuration commit, the following tests occur:
- Roll back a virtual machine instance to a known state
- Start the virtual machine
- git pull the latest repository code
- Re-install all the configuration files
- Start all services
- Run a series of code unit tests
- Run a series of functional tests that test logging into the web interface, sending email, receiving the email, reading the email, and much more. There’s also a series of email delivery tests to check that all the core aspects of spam, virus and backscatter checking work as expected.
This test should have picked up the problem. So what happened? Well a day or so before the problem commit, another change occurred that altered the structure of branches on the primary repository. This caused the git pull the CI server does to fail, and thus the CI tests to fail.
While we were in the process of working out what was wrong and fixing this on the CI server (only pulling the "master" branch turned out to be the easiest fix), the problematic commit went in. So once we fixed up the git pull issue, we then found that the very first test on the CI server was failing with some strange IMAP connection failure error. Rather than believing this was a real problem, we assumed it was due to something else with the CI tests being broken after the branch changes, and resolved to look at it on Monday. Of course the test really was broken due to a bad commit, and as Murphy’s Law would dictate, someone else would do a rollout on Friday Norway time of the broken commit.
Putting it all together
A combination of software, human and process errors all came together to create an outage that affected all Fastmail users for several hours at least, and some users even more.
Obviously we want to ensure this particular problem doesn’t happen again, and more importantly, that processes are in place where possible to avoid other cascade type failures in the future. We already have tickets to fix the particular code and configuration related issues in this case:
- link cyrus cores and log directories to another partition
- make skiplist more robust in case of a disk full condition
- make cyrus.header/index/cache more robust in case of a disk full condition
- cyrus master respawning should backoff in the case of immediate crashes
- fix cyrus quota -f to avoid random quota usage doubling
We’ve also witnessed how important keeping the CI tests clean, and tracking down all failures are important. We’ve immediately added new tests to sanity check database, imap and smtp connections as a very first step before any other functional tests are run. If any of them fail, we tail the appropriate log files and list the contents of the core directories, so the CI failure emails that are sent to all developers will make it very clear that there’s a serious problem that needs immediate investigation.