Improved default search behaviour in classic interface

When we introduced the current interface, one of the features we were really happy with was our vastly improved searching. Basically we implemented a full text index that allowed you to search the headers and content of all your email in all folders for any words in a few seconds (in IMAP parlance, this uses the FUZZY SEARCH extension)

At the time, we decided to leave the search on the classic interface as it was (by default, search from/to/cc/subject headers but not the message content and search on substrings rather than whole terms).

However the general consensus from classic users is that they’d really like the improved search that the current interface comes with. So today we’ve rolled out a change that better unifies the search syntax on both the classic and current interfaces.

So now on both interfaces if you do a search:

dinner john

It will do a fast indexed search of the from, to, cc, bcc and subject headers as well as the message body content for messages that contain both “dinner” and “john”. This search is done on words/terms with stemming where possible, not sub-strings. This searches the current folder by default on classic, and across all folders by default on the current interface. On classic, you can check the “All” checkbox to search across all folders.

If you want to revert to the historical sub-string searching of headers, you can use the substr: modifier. Some more examples:

  • example – fuzzy search from/to/cc/bcc/subject headers and message bodies for the word “example”
  • body:example – fuzzy search message bodies for the word “example”
  • to:example – fuzzy search to/cc/bcc headers for the word “example”
  • onlycc:example – fuzzy search cc header for the word “example”
  • substr:example – search from/to/cc/subject headers (but not body content) for the substring “example”
  • substr:(dinner john) – search from/to/cc/subject headers for both substrings “dinner” and “john”
  • substr:(body:example) – search message body content for the substring “example” (warning: likely very slow!)
  • substr:(to:example) -  search to/cc/bcc headers for the substring “example”
  • substr:(onlycc:example) -  search cc header for the substring “example”

A complete list of all the search options can be found on our mailbox searching help page.

Posted in News. Comments Off

Cleaning up from an IMAP server failure

This blog post is highly technical. I cover details about how our email storage system works and how it was impacted by a complex server corruption and failure. I explain why our normal procedures failed in this instance, the ways in which our system helped us to track down and restore almost all the impacted emails, and some improvements we are making for our system to be more resilient in future.

Background

To show how things fit together, I’m going to start with an overview of our infrastructure and internal terminology. Some details have been shared before, but the exact setup changes over time. If you want to get straight on with the story, skip to the heading “The failure” below.

A slot

Each mail server is a standalone 2U or 4U storage server. Storage on each server is split into multiple slots (currently 15, 16 or 40 per machine depending on hardware model).

All current slots are called “teraslots”, consisting of:

  • A 1Tb partition for long-term email storage.
  • Space on a shared high-speed SSD for indexes, caches, recently delivered email, etc.
  • Space on a separate partition for long-term search indexes.
  • Space on a RAM filesystem for search indexes of recently delivered email.

All partitions are encrypted, either with LUKS or directly on the hardware if supported.

Every slot runs an entirely separate instance of the Cyrus IMAP server, complete with its own configuration files and internal IP address. Instances can be started and stopped independently. Configuration files are generated by a template system driven from a master configuration file, for consistency and ease of management.

A store

A store is the logical “mail server” on which a user’s email is stored. A store is made up of a number of slots, using Cyrus’ built-in replication to keep them synchronised. At any time, one slot is the master for a store, and it replicates changes to all the other slots.

A normal store currently consists of three slots, two in New York and one in Iceland. To move a slot between machines, it’s easiest to configure a new replica slot, wait until all data is replicated and the new slot is up to date, and then remove the unwanted replica slot. In the case where we’re moving things around, a store may consist of more slots – there’s no real limit to how many replicas you can run at once.

We spread the related slots of different stores such that no one machine has too many “pairs” to another machine. This means that if a single server fails, the load spreads to many machines, rather than doubling the load on one other box.

For each store, one slot is marked in our database as the “master slot”. In the past we used to bind an IP address to the master slot and use magic IP failover, but no more. Instead, all access is either via a perl library (which knows which slot is the master) or via the nginx frontend proxy, which selects a backend using the same perl library during login.

Replication

The cyrus replication system doesn’t record the actual changes to mailboxes: it just writes “something changed in mailbox X” to a log file (or in our multi-replica configuration, to a separate log file per replica).

A separate process, sync_client, runs on the master slot. It takes a working copy of the replication log files that other cyrus processes create. If the sync_client process fails or is interrupted, it always starts from the last log file it was processing and re-runs all the mailboxes. This means that a repeating failure stops all new replication for that replica, which becomes relevant later.

The sync_client process combines duplicate events, then connects to a sync_server on the other end and requests some information about each mailbox named in the log. It compares the state on the replica with the state on the master, and determines which commands need to be run to bring the replica into sync, including renaming mailboxes. Mailboxes have a unique identifier as well as their name, and that identifier persists through renames.

We have extensively tested this replication process. Over the years we have fixed many bugs and added several features, including the start of progress towards a true master-master replication system. We still run a task every week which compares the state of all folders on the master and replica, to ensure that replication is working correctly.

Failover

Our failover process is fairly seamless. If everything goes cleanly, the only thing users see is a disconnection of their IMAP/POP connection, followed by a slow reconnection. There’s no connection error and no visible downtime. Ideally it takes under 10 seconds. In practice it’s sometimes up to 30 seconds, because we give long-running commands 10 seconds to complete, and replication can take a moment to catch up as well.

Failover works like this (failover from slot A to slot B):

  1. Check the size of log files in replica-channel directories on slot A. If any are more than a couple of KB in size, abort. We know that applying a log file of a few KB usually only takes a few seconds. If they’re much bigger than that, replication has fallen behind and we should wait until it catches up, or work out what’s wrong. It’s possible to override this with a “FORCE” flag. We also check that there’s a valid target slot to move to, and a bunch of other sanity stuff.
  2. Mark the store as “moving” in the database.
  3. Wait 2 seconds. Every proxy re-reads the store status every second from a local status daemon which gets updates from the database, so within 1 second, all connection attempts to the store from our web UI or from an external IMAP/POP connection will be replaced by a sleep loop. The sleep loop just holds off responding until the moving flag gets cleared, and it can get a connection to the new backend.
  4. Shut down master slot A. At this point, all existing IMAP/POP connections are dropped. It will wait for up to 10 seconds to let them shut down cleanly before force closing the remaining processes.
  5. Inspect all channel directories for log files on slot A again. Run sync_client on each log file to ensure that they sync completely. If there are any failures, bail out by restarting with the master still on slot A.
  6. Shut down the target slot B.
  7. Change the database to label the new slot B as the master for this store.
  8. Start up slot A as a replica.
  9. Start up slot B as the master (this also marks the store as “up”).

Within a second, the proxies know that the store is available again and they continue the login process for waiting connections.

Unclean failover

In the case of a clean failover, all the log files are run, and the replica is 100% up-to-date.

If for some reason we need to do a forced failover (say, a machine has failed completely and we need to get new hardware in), then we can have what’s called a “split brain”, where changes were written to one machine, but have not been seen by the other machine, so it makes its own changes without knowledge of what has already happened.

Split-brain is a common problem in distributed systems. It’s particularly nasty with IMAP because there are many counters with very strict semantics. There’s the UID number, which must increase without gaps (at least in theory) and there’s the MODSEQ number, which similarly must increase without changes ever being made “in the past”, otherwise clients will be confused.

Recovering from split brain without breaking any of the guarantees to clients was a major goal of the rewrite of both the replication protocol and mailbox storage which I did in 2008-2009. These changes eventually lead to Cyrus version 2.4.

Anti-corruption measures

We also want to be safe against random corruption. We have found that corruption occurs a couple of times per year across all our servers (the cause of this is hard to say, most likely hard drive or RAID controller issues from what we’ve seen) and we were bitten hard in the past by a particularly nasty linux kernel bug which wrote a few zeros into our metadata files during repack.

Since then we have added checksums everywhere. A separate crc32 for each record in a cyrus index, cache or db, and a sha1 of each message file. Cyrus also sends a crc32 of the “sync state” along with every mailbox, allowing it to determine if the set of changes did actually create the same mailbox state during a sync. A sync_crc mismatch triggers a full comparison of all data in that mailbox, allowing sync_client to repair the error and resync the mailbox.

And now you know enough about our architecture to understand what happened!

The failure

On Thursday 27th February, at 4:30am Melbourne time, Rob N (the on-call admin for the night) was paged by our monitoring system because one of our servers (imap21) was not responding. You can read his initial blog post about the incident as well.

He initially thought it was just a crashed server – there are various reasons why complex systems crash, and it wasn’t obvious which one it was. Our normal procedure is to restart the server and restart all the IMAP slots, replicate any remaining changes off the machine, then fail over to replicas while the search indexes catch up.

Unfortunately, things were worse than that. It started up with significant corruption on the operating system partition, and one of the 15 IMAP partitions failed to mount. At this point he force-failed-over (meaning moving the IMAP server master to its New York replica without doing a full sync first) and went back to bed to get some sleep and deal with things in the morning.

Initial cleanup

I was away during the first couple of days of this, on school camp with my daughter in one of the few places in the world where you can’t even get phone reception, but I have notes from the others and log files to go on.

On Thursday morning, there were a number of support tickets from people who were missing a few days of recent email. A quick look at the logs showed that one of the 11 master slots (of the 15 on this server, two were empty and two were replicas) had not been replicating for a couple of days.

The cause of this was an edge case issue with our replication system. If there is a loop in mailbox renames (i.e. folder ‘A’ was renamed to ‘B’ at the same time that folder ‘B’ was renamed to ‘A’, or a more complex circle of names), then the replication system bails out and tells us it needs to be fixed.

Normally I would notice this, and the others knew about the possibility as well, but over the previous few weeks I had been moving slots around in Iceland as part of our efforts to split entirely from the Opera network while retaining full offsite redundancy. There had been a lot of cyrus noise in the notification channel, and nobody picked up that this noise pointed to a more significant issue.

As mentioned above, this single issue in the working log file means that no new events can be processed. Interestingly, changes to other mailboxes in the same file might get replicated, because it will re-visit them over and over, but other newly-changed mailboxes will not be noticed.

To make things worse, the failed channel was also the “new user server”. Every week a task runs , calculates which store is least loaded, and sets a database field to direct newly created users to that store. So there were brand new users, and some of them had used our import facility to bring large amounts of email in.

Re-replication

The usual fix in this sort of case is to re-run the log files, even though the slot is now a replica. Because we keep a sha1 of the content of each message, the replication system can detect different files at the different ends. It can then fix up mailboxes by re-injecting BOTH messages with new UIDs, higher than any yet used, and then expunging the original UID at both ends. This means that no matter which one a client saw first, it now gets a repaired view of the mailbox with both messages in it. I wrote a long justification for this logic, and why we do it, which you can read at https://lists.andrew.cmu.edu/pipermail/cyrus-devel/2010-July/001500.html.

We also use the same method in reconstruct to repair from a damaged mailbox on disk, but I’m about to re-visit that based on the below.

So, for messages within an existing mailbox, almost all were restored by this, except where things were broken for other reasons.

Bogus reconstructs

Because sync wasn’t working, the team in Melbourne started running reconstructs to repair broken folders. In theory, this was a great idea – it’s what I would have recommended.

In practice though, the data on the machine was badly broken in subtle ways, and this just made things worse. Looking through the log files, I see numerous cases where reconstruct determined that the stored sha1 of the message was not the same as the sha1 of the file on disk, and so injected the message again with a new UID. This should never happen unless the file is actually corrupted. Looking at the files on disk now, their sha1 is actually the old value, not the new one that it calculated during the reconstruct.

So the machine either has a faulty CPU, faulty memory, dodgy RAID card, intermittently faulty disk… whatever it is, it’s pretty horrible! We will be able to test it more once we’re happy that all the salvageable data is copied off.

Due to the faulty hardware, the attempts to make things better actually made things worse!

The worst cases were where reconstruct decided that a folder didn’t exist, and wiped it from the mailboxes database. The replication engine then wiped the copy on the replica. Likewise, where mailboxes didn’t exist at the other end due to not being replicated yet (see above), they got wiped from the original copy. These are the cases where, for 49 users, we can see that some emails were lost entirely.

Thankfully, repeated checksum mismatches caused replication to bail out in many places, saving us from a lot worse pain.

Time to cut our losses

On the morning of Friday 28th when I got back into phone range, pulled out my laptop on the bus home and inspected the damage, sync was still bailing out everywhere.

First thing I said was “turn off all the replication and reconstructs – let’s stop making things worse”.

I had an empty machine (imap14) which I had just finished clearing out in New York (see above about all the noise – lots of reconfiguring to get everything into consistent teraslots). I configured new replica pairs on imap14 for all the slots that used to be paired with imap21. It took about 4 days to get everything fully replicated again (moving that much data around, including sanity checking and building new search databases takes time), but we’re now fully replicated again, nothing fewer than 3 slots for each store, one offsite.

I left imap21 up, but with nothing talking to it or trusting it any more.

Examining the mess

The first job was to determine the extent of the damage. Thankfully, the log partition on imap21 wasn’t damaged, so we had backup copies of all the log files from everywhere on our log server by the time I got home. These aren’t the ‘sync log’ files from above; they’re the general syslog used to monitor and audit the everyday operations of the system. We keep them around for a few months to help track down the history of problems.

The first step was to identify what had happened, and locate any messages that hadn’t been replicated. We syslog important actions within Cyrus, in a consistent format with the label ‘auditlog:’.

Here’s an example of a recent message appended to my mailbox and then moved to Trash:

2014-03-10T03:11:04.544356-04:00 imap20 sloti20t12/lmtp[23349]: auditlog: append sessionid=<sloti20t12-23349-1394435464-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:31.374972-04:00 imap20 sloti20t12/imap[22916]: auditlog: append sessionid=<sloti20t12-22916-1394436442-1> mailbox=<user.brong.Trash> uniqueid=<24bbf0f44475158a> uid=<1157077> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:31.390532-04:00 imap20 sloti20t12/imap[22916]: auditlog: expunge sessionid=<sloti20t12-22916-1394436442-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>

(The sessionid field can be used to track the exact login time, IP address, etc – and for deliveries it can link through the logs to find the server which sent it to us originally).

It was appended to my INBOX via LMTP, then appended to my Trash folder and expunged from my Inbox. The ‘guid’ field is the sha1 of the underlying message.

Over on my New York replica slot:

2014-03-10T03:11:04.544356-04:00 imap20 sloti20t12/lmtp[23349]: auditlog: append sessionid=<sloti20t12-23349-1394435464-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:31.374972-04:00 imap20 sloti20t12/imap[22916]: auditlog: append sessionid=<sloti20t12-22916-1394436442-1> mailbox=<user.brong.Trash> uniqueid=<24bbf0f44475158a> uid=<1157077> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:31.390532-04:00 imap20 sloti20t12/imap[22916]: auditlog: expunge sessionid=<sloti20t12-22916-1394436442-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>

And of course, the same in Iceland:

2014-03-10T03:11:06.127098-04:00 timap5 slotti5t02/syncserver[6838]: auditlog: append sessionid=<slotti5t02-6838-1394336626-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:33.575200-04:00 timap5 slotti5t02/syncserver[6838]: auditlog: expunge sessionid=<slotti5t02-6838-1394336626-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:33.833070-04:00 timap5 slotti5t02/syncserver[6838]: auditlog: append sessionid=<slotti5t02-6838-1394336626-1> mailbox=<user.brong.Trash> uniqueid=<24bbf0f44475158a> uid=<1157077> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>

Anyway, we found something else interesting in the log files. For nearly 24 hours before the hard crash, there had been random transient failures in the replication logs. It would attempt to replicate a change, get a checksum error, try to replicate the exact same change, and the second time it would work! The machine was obviously intermittently failing well before it finally crashed.

We also found that two of the other master slots had encountered replication errors before the hard crash. The errors had caused replication to repeatedly bail out and they were also falling behind – this was enough noise that anyone would have noticed something was wrong – but it started about 2am our time, and everyone was asleep.

Based on user tickets and the state of the machine before the crash, we knew that there was still email missing. It was time to try to identify it. We determined the following algorithm to find messages that needed to be recovered.

  1. If there was an ‘append’ on imap21 via something that was NOT sync_client or syncserver, but no matching append on the replica with the same guid, that’s bad.
  2. If there was an ‘expunge’ on the replica by a sync program that can’t be matched to an expunge by an imap or pop3 process, then that was a message which shouldn’t have been wiped (but see above about the re-UIDing process; if we find another append to the same folder with the same GUID in either case, then the lower-UID one isn’t interesting any more).
  3. If a folder was deleted by a sync tool without a matching intended delete, then we have a problem.

Replication – the bad bits

I discovered, to my horror, that when replicating an expunged event from the master, the sync protocol ignored the sha1 of the message. It should either have checked that it was the same message file that was expunged on the replica as well, or just bailed out. This meant it would expunge a never-seen message from the replica if the same UID had been seen and then deleted on the master. Ouch.

This is also a flaw with folder deletes. When a deleted folder is replicated, it just removes the folder and message files from the disk on the replica. That’s wrong and should never be allowed. To ensure we can always recover accidentally deleted mail, any deleted folders are supposed to be moved to a special DELETED namespace. They’re only permanently deleted a week later by a cleanup process.

The end result of all this: some new folders created after the failover from imap21 were irreversibly deleted when we tried to sync over the missing messages from imap21.

Recovery of messages

In either case, an append that never synced or an expunge that wasn’t justifiable, the fix is the same. Find a file with the same sha1 in any of that user’s mailboxes, or their backup, or on disk on imap21, and that’s your message. We did that for nearly half a million emails in total, over half of those belonging to 3 users who had just imported a large amount of email from external services.

After that, we were left trying to recover a few folders which got wiped. The fix in that case is to recover from backup (hopefully most of the messages) and then search for any appends since the highest UID in the backup, and append those as well (unless a justifiable expunge can be found for them).

The great thing about the log file containing a GUID which is the sha1, is that you can determine with cryptographic certainty that you’ve found the right file. We pulled files from imap21 (don’t have to care about corruption, the sha1 will check it), files from the replicas, and even the backup server for the user to look for matching messages.

Some statistics

  • 370 affected users
  • 458471 found messages
  • 29837 lost messages (no sha1 file able to be found)
  • 32 lost folders (deleted, nothing in backup)
  • 49 users with SOMETHING lost

We have emailed all affected users with details of exactly what was lost and recovered for their account.

Making sure this does not happen again

We take our uptime and reliability very seriously. Losing anything is a major failure for us, and we are determined to take steps to ensure this situation can not happen again.

Operational changes

  • We now page the on-duty if replication falls more than 5 minutes behind to all replicas.
  • We alert (non-emergency) if ANY replica falls more than 5 minutes behind. This alert is made to look different from the Cyrus notices that can happen during regular maintenance, so everybody will be aware that there is an issue.
  • We will no longer attempt to reconstruct or keep replication going if we suspect a faulty server (at least until replication is super-safe). We’ll be much quicker to just declare the machine faulty and recover messages independently.

Software changes

  • The immediate fix is to do the GUID test even if the message was expunged on the master. This avoids the nasty case of expunging an unseen message during split brain.
  • The larger fix is two-pronged. First make reconstruct a lot less likely to damage existing mailboxes in case of corruption. There’s no point re-injecting a corrupt message file, nobody wants a file full of random blocks off disk. If the sha1 doesn’t match, then abort and get the admin to fix the permission problems or check the disk first.
  • The second prong is changes to the replication system, so that it’s impossible to make the replica delete message files immediately via the replication protocol. The most you will be able to do is make it move them aside for later removal on the scheduled rotation (we run on Saturdays and delete things that have been marked for deletion for at least one week at that point – so between 1 and 2 weeks).
  • Finally, we want to integrate the ability to fetch files from a replica to repair a corrupt local copy. We already have a separate magic perl script that can do this, but it runs outside Cyrus, with its own ugly locking tricks to force Cyrus to accept what it’s doing. It would be much neater to have this integrated into reconstruct, so that it’s replication-aware.

Finally

This has been a long saga (2 weeks from failure to restoring everything we could), and a learning experience for everyone involved. Our detailed logging and checksums meant we were able to recover the vast majority of messages affected by the corruption, but we are obviously unhappy that we lost anything, and we will be taking the steps outlined above to prevent this issue happening again.

We apologise to the 370 users who have been without some email for over a week while we took our time making sure we fully understood what had happened and what needed to be recovered. We’re especially sorry for small number of people for whom we lost emails irretrievably. We have contacted these people individually.

We’re proud of our reliability track record at FastMail and we respect the trust that people place in us to store their email. We are working hard to restore and maintain that trust, both by being open about exactly what happened with this incident and by updating the system and processes we have in place to improve our resilience in the future.

Posted in Technical. Comments Off

Diary of an outage

As some of you are no doubt aware, yesterday we had a fairly serious outage. It only affected a small number of users, but for them it meant some some 4-6 hours with partial or no access to mail. In this technical post I’ll be explaining exactly what happened and what we are doing to fix the problem and ensure it doesn’t happen again in the future.

For the non-technical readers, you can skip to the last paragraph.

Early morning page

At around 4:30am Melbourne time (around 17:30 UTC) I was paged by our automated monitoring tools. We have extensive monitoring of most aspects of our infrastructure, and the system has permission to wake people up if it notices a problem. In this case, it noticed that one of our many backend mail servers, imap21, was no longer reeachable. Unlike many other service providers we don’t do automatic failovers because if the software makes the wrong decision it can lead to worse problems than might otherwise occur. We prefer to put a human into the loop to make sure things look sane before taking any action.

A machine failure of some sort is not as uncommon as you might think. Usually it’s a single disk failure, but we’ve variously had disk controllers fail, power supplies fail, and kernel or other bugs cause machines to crash. In the event of a crash, the machine usually reboots. The normal procedure for a night-time failure is to failover all mail stores on the machine to replicas on other machines. That’s anywhere between 15 and 40 stores depending on hardware configuration. We prefer to return service to the original machine if possible so we can maintain the replicas and thus keep our redundancy level, but that’s usually not possible in the case of a hardware fault – the machine is no longer fit for service.

The failover process involves shutting down both the current “master” slot and the target replica slot, reconfiguring them such that the master becomes the replica and the replica the master, and then starting them both up. The system configuration database is also updated so that all other services (web client, IMAP/POP frontends, mail delivery) know where to find the stores now. This is all done by a scripted process that can be initiated with just one command. In the event that the current master slot is unavailable, that part is skipped.

This morning imap21 failed in a fairly serious way. It was not available on the network. The management console showed that it tried to reboot, found corruption on the operating system disk, tried to repair it and couldn’t and was waiting for manual intervention. That immediately told me this was a serious problem and the best course of action was not to try and repair it, but instead failover its 15 slots to replicas. I did this, checked our monitoring to confirm all systems were operating normally, and went back to bed. I knew that full repairs would potentially be a big job, and I wanted enough sleep to be able to do it without messing things up further. As far as I was able to tell service had been fully restored to users, which is the single most important thing.

Replication logs

Fast forward to 9am. I arrived at the office after a bit more sleep at home and on the train (you take what you can get!). On arrival I was told that a number of users had been reporting either missing mail or errors on login. The first thing I thought of turned out to correct: at least one of the replicas I had failed over to were not up to date. But first we had to prove it!

The Cyrus replication model is an “eventually consistent” one, fairly common among database server software (which is what Cyrus is, though optimised for email storage). When some action is performed (delivering a message, creating a folder, moving a message, etc), the operation is performed on the local data store, and a record of the operation is written to a replication log. A separate pair of programs (sync_client and sync_server) use that log to perform the same actions on the replicas. Typically that happens in near realtime, but if replication was in fact behind, then we should see lots of unprocessed lines in the replication log.

Getting to the replication log was something of a problem because as noted above, imap21 refused to boot. A filesystem check was not pretty – literally thousands of errors. That’s quite worriesome as this is a root disk which generally shouldn’t have any corruption on it (it almost never recieves writes), but figuring out why the crash happened is something of a secondary concern at this point. So I reinstalled the OS into the root partition. That’s something we do all the time and it’s fast and accurate. Twenty minutes later it was fully reinstalled and the machine was up and running, and we got to inspect the replication logs.

The good news was that 14 of the 15 stores on this machine had fully up-to-date replicas, so there was only one mail store to deal with. The bad news is that the remaining store had some 2-3 days worth of unreplicated events in it’s log. To add more pain to the situation that store was coincidentally the current “new user” store, where new users are created. That means that any users that had signed up in the last 2-3 days did not exist at all on the replica, thus the reported login errors. A horrible first experience!

Examining the log, we discovered that right at the top there was a complex series of folder renames within a single replication event. This is not a particularly unusual operation. This time it tripped a known, rare bug in the way renames are replicated that caused the replication process (sync_client) to abort. The Cyrus master daemon starts it up again, but then it hits the same point and dies again, over and over. Replication stops.

Fixing this is on our list of things to do. Because it’s a fairly rare thing to happen and usually gets dealt with quickly, it hasn’t quite made it far enough up our list to deal with. Obviously that has now changed, but lets talk about why it wasn’t dealt with this time.

Monitoring the monitors

As noted, we have a lot of system monitoring running, and we place a huge amount of faith in it. It’s almost by definition that if there’s no problems being reported, then there are no problems, at least in overall system health. Of course individual per-user problems can appear from time to time, which is unfortunate. A support ticket is the usual way to get to the bottom of individual problems like that.

Something we monitor is replication lag, which is almost entirely based on the size of the replication logs. If they grow “too large”, a low-level warning is produced. A warning of this type results in a message being posted to our IRC channel and an email being sent, but it will never generate a page. It’s the kind of thing that we look at and action every now and again, and as noted it’s a fairly rare event under normal operation.

There is however one time where that warning can occur yet not be a problem, and that’s when users are being moved around a lot. User moves are done via the replication system, and when you’re doing a lot of them at once it can generate a lot of replication log traffic, sometimes causing replication to lag significantly for short periods of time. Something we’ve been doing for the last few months is redesigning our disk layout into something much easier to reason about and work with, and that has required a lot of moves. This particular warning has not been as uncommon as it should have been recently, leading to the situation where we’ve started ignoring it.

Obviously this is a dangerous place to be. I recognise how bad that sounds – “system warnings were ignored” – but this is what happens when you have a warning that doesn’t quite match up with the importance of the situation. Think of it like the fuel light on your car. It needs to be calibrated to come on at just the right time. Too late, and you run out of fuel before you can refill. Too early, and you start to learn that it isn’t really a problem; you won’t run out of fuel for ages. At that point you might as well not have the light at all. It’s not a perfect analogy, but it’s instructive.

So far we haven’t figured out the original cause of the filesystem corruption (we suspect a hardware failure that isn’t visible to our normal tools) that led to the crash. But that’s not quite the point. Had appropriate action been taken when the replication lag was first noticed, we would have had fully up-to-date replicas at failover, and this entire situation would have been little more then a ten-minute outage and me being a little tired for the day.

Now what?

Obviously, we have a few things to do out of this!

  • The replication lag monitor needs to be able to send a page when things are getting bad. Our current thinking is that it should page if it doesn’t have a least one replica less than five minutes behind.
  • It also needs to understand that there are other things that can cause lag and compensate, which means it needs to know what user move operations are in progress. That said, replication falling behind is always a little dangerous, so we might be better to somehow change the way moves happen so that we always have at least one viable replica at all times. We’ll need to consider this in more detail.
  • We need to make the replication system able to cope with hard replication failures. One idea we’re currently considering is to put failed replication events into another logfile, and come back to them once the main log is empty. This needs a bit of thought, particularly around the consequences of operations being applied out-of-order.
  • We need to fix the renaming bug. Having a last-ditch protection against replication problems is great, but even better than handling problems is avoiding them in the first place.

Conclusion

It’s now 13 hours since the problem was fully understood and recovery began. We believe that most if not all missing mail and broken accounts have been repaired. If you were unfortunate enough to be affected and you’re still seeing problems, please contact support.

We are very sorry for this outage. We understand how important your email is and how much it affects you when it’s unavailable. We’re proud of our track record on reliabilty but we know we’ve dropped the ball on this one. To our new users, who are the ones most likely affected by this outage, we understand very well how this makes us look. We’re working hard to get the situation resolved and to restore your trust. Thank you all for your patience and understanding.

Posted in Technical. Comments Off

Content Security Policy now on Beta

At FastMail, we’re always looking to increase security for our users. Cross-site scripting (XSS) attacks are one of the dangers that all websites must take care to mitigate against. HTML email is the highest risk for all webmail providers. Before embedding it into a page, it must be carefully checked and any potentially malicious content removed. In particular, all scripting content must be removed otherwise an attacker could gain access to your account and email.

Due to the complex nature of HTML parsing and encoding, there are many ways that a malicious email might try to sneak through scripting content. That’s why we fully parse the HTML first on the server and sanitise it against a white-list of known-good tags and attributes. This ensures that any scripting content is stripped, and other ambiguous content is properly escaped and encoded.

We’re very careful, and we have lots of tests to ensure we protect against all known techniques for trying to embed scripts. However, there’s always a possibility of bugs in any software, and Content Security Policy, also known as CSP, is a new HTTP header that provides an extra layer of defence against these types of attacks.

With CSP, we can instruct all modern browsers to only ever load scripts from our own website. Any references to remote scripts or "inline" scripts will be blocked. This means if a malicious email somehow slips through our filters, the browser still stops it from doing anything dangerous.

We’ve just rolled this out on our beta server (https://beta.fastmail.fm) for testing. We hope to roll it out everywhere soon. If you use our beta server, please let us know of any new issues you notice by emailing betafeedback@fastmail.fm. Some browsers may have issues with extensions. These should be allowed to run according to the spec, but some browsers
may prevent them from doing so as a violation of the content security policy. If you have a problem with an extension at FastMail, please first try updating to the latest version. If the issue still persists, please let us know so we can contact the extension authors.

Posted in Technical. Comments Off

Secure SSL/TLS access to LDAP and DAV now mandatory

Over the last few years we’ve been phasing in mandating SSL/TLS encryption on all connections between user machines and our servers, ensuring that no one can eaves drop on your username or password to steal your login credentials.

We’re continuing with that process today by disabling non-SSL/TLS access to all LDAP and DAV services. We emailed everyone we believe that was using these services a few weeks ago to inform them of the upcoming change.

This means if you use LDAP to access your address book, you must use port 636 with SSL/TLS enabled.

If you use DAV to access files in your file storage, you must use https://dav.messagingengine.com, not http://dav.messagingengine.com (note the additional “s” in https://).

Posted in News. Comments Off

Calendar now available on beta.fastmail.fm for testing

We’re very excited to have released the web UI for our new calendar on to our beta server for public testing at https://beta.fastmail.fm.

To access, simply log in to your FastMail account at https://beta.fastmail.fm and select "Calendar" from the menu in the top left.

Note: The calendar is only available when you log in to the new user interface (the default). If you use the “Classic” user interface (by explicitly selecting it at login time or because you have an older browser like IE6/7 which doesn’t support the new interface) the calendar will not be available. We currently have no plans to port the calendar to the classic user interface.

Tips and tricks

  • Open the settings and select the new Calendar panel to enable a few advanced features, create new calendars and make sure your time zone is set correctly
  • You can drag and drop events to move them (or hold down alt whilst dragging to copy).
  • There are keyboard shortcuts for navigating (try j/k, or hit g), and also for the buttons in the action bar at the top (hover over the button to get a tooltip with the shortcut).

Sync to mobile/calendar software (CalDAV)

You can also sync your calendar with your mobile device as long as your device supports the CalDAV protocol (iOS supports it natively, Android requires a separate program, CalDAV-Sync works well, it costs around US$2). The required details are:

Most clients should correctly auto-discover your calendars. If that fails, you might need to setup the full CalDAV path in your client which is:

Make sure you replace fullfastmailusername@domain with your full FastMail username and domain.

Access restrictions

Not all service levels have access to the Calendar features.

Web UI

Available to all levels except legacy Guest and Member accounts (at the moment, the link is shown but events will not save; we will improve the UI here to make it clear that the calendar is not available to these service levels)

CalDAV

Available to:

  • Enhanced and Premier level accounts (Personal/Family)
  • Standard, Professional and Enterprise level accounts (Business)

That is, these levels do not have CalDAV access: Guest, Member (legacy). Lite, Full (Personal/Family). Basic (Business)

Current known issues and missing features

We’re actively working on all these issues.

  • There are currently some layout issues in older browsers. Supported browsers for now are: Chrome 21+, Firefox 22+, Opera 12.1+, Safari 6.1+, IE9+.
  • Alerts do not work yet (you can set them in the event editor, but no alert will be shown when they are triggered).
  • Email reminders do not work yet (you will not be sent an email).
  • Emails are not yet sent out to people invited to an event.
  • An email is not currently sent when you respond to an invitation in the calendar view.
  • Integration with the mail part of our web UI (save attached event to calendar etc.)
  • Support for files attached to events.
  • Support for calendar sharing between users in a family/business
  • Support for subscribing to public iCal files and showing them in your calendar.
  • An easy way to import/export calendar data.

Please post bug reports, feature suggestions or other comments either in the forums or email us at betafeedback@fastmail.fm. We can’t respond to everything, but we do read it all.

Posted in News. Comments Off

Increased storage quotas and other service level changes

we’ve made some changes to our service levels. These changes simplify pricing, unify personal and family service levels, make it easier to migrate from other services to FastMail and give the vast majority of users an increase in storage at no extra cost.

All accounts

All prices have been rounded to the nearest dollar. The existing $x.95 pricing on all accounts made comparing prices more difficult for users, so we’ve changed to just using whole dollar pricing on all accounts to make comparisons simpler and clearer.

Personal accounts

  • Ad Free has been renamed to Lite. Now that we don’t have free Guest accounts, the name is an anachronism since no accounts have advertising.
  • All Lite (previously Ad Free) and Enhanced accounts have increased email and file storage quotas. Full accounts have increased file storage quotas.
  • The price of the Lite account is now $10/year. We will email all existing Ad Free users about this change shortly with information about how to lock in the existing pricing for some years.

The complete list of new quotas and names are:

    Old quota New quota
Lite
(was Ad free)
Email:
Files:
100 MB
2 MB
250 MB
100 MB
Full Email:
Files:
1 GB
100 MB
1 GB
1 GB
Enhanced Email:
Files:
10 GB
2 GB
15 GB
5 GB
Premier Email:
Files:
60 GB
30 GB
60 GB
30 GB

Family accounts

We’ve renamed all the family service levels to be the same as the personal service levels, and also have the same quotas as the corresponding personal service level.

This means that if you manage more than one account, it’s much easier to switch to a family account. There’s no concern about having slightly different quotas and having to deal with all the different service level names, it’s a straight forward conversion from a personal account to the corresponding family account.

So now the main differences with family accounts compared to personal accounts are:

  • A single billing cycle and credit card for all accounts in the family
  • Add/change/delete accounts in the family from your management accounts at any time
  • Ability to have account names in your own domain (e.g. john@yourdomain.com)
  • Ability to have your own login screen at http://mail.yourdomain.com
  • It’s an extra $5/year for the family "container"

The complete list of new quotas and names are:

    Old quota New quota
Lite Email:
Files:
200 MB
6 MB
250 MB
100 MB
Full
(was Everyday)
Email:
Files:
800 MB
600 MB
1 GB
1 GB
Enhanced
(was Superior)
Email:
Files:
8 GB
6 GB
15 GB
5 GB
Premier
(new)
Email:
Files:
N/A
N/A
60 GB
30 GB

Business accounts

All Basic, Standard and Professional accounts have increased email and file storage quotas. The complete list of new quotas are:

    Old quota New quota
Basic Email:
Files:
250 MB
2 MB
500 MB
100 MB
Standard Email:
Files:
1.5 GB
100 MB
2 GB
1 GB
Professional Email:
Files:
15 GB
6 GB
25 GB
10 GB
Enterprise Email:
Files:
150 GB
60 GB
150 GB
60 GB

We’ll be emailing these details to all users shortly.

Posted in News. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 4,999 other followers