Payment Issues

Recently there have been some problems with our payment processor.

One of these is that sometimes there is a significant delay in completing a transaction, so the funds are “reserved” but the transaction isn’t actually completed until a few weeks later.

The other problem was that a number of charges were made to some of our customers’ credit cards, in our name. The majority of these are USD 1.00 test charges that are used to verify that a card is valid. Normally we cancel these charges immediately and the charge doesn’t appear on the card statement. However the problem at our payment processor has resulted in some of these supposedly “cancelled” charges recently being applied to customer cards. A number of other charges which were supposed to be “cancelled” have also been processed. This was done in error and without any instructions from FastMail.

We are currently working with our payment processor to refund these erroneous payments. In some cases, the refunds are unable to be processed directly because the card has expired or been cancelled.

We may need to refund these charges via an alternative method, or, at the user’s option, instead credit the amount towards future FastMail renewals. We will be in touch with the affected customers once we have more information on the scope of the problem.

We would like to assure our users that we will make certain that all erroneous charges are corrected. This may take some time, so thanks for your patience.

Our apologies for any inconvenience or confusion. We will post more information as it becomes available.
– the FastMail team

Posted in Technical. Comments Off

New phishing trick, data: URLs to avoid forgery reporting

This is a technical post about a new and interesting phishing technique seen today. Regular FastMail users can skip this post.

We saw an interesting new phishing attempt today that uses a relatively novel technique to try and hide the source of the attack and avoid it being reported as a web forgery.

Firstly the email itself looks reasonably well done (apart from the year in the subject being completely wrong), certainly it’s not the poor quality you often see. It looked like this (ANZ is an Australian bank):

phishing

Secondly, the email was sent using a compromised gmail account with a .edu address. In fact there were two separate emails, both from different compromised gmail .edu accounts. I imagine compromised gmail .edu accounts aren’t that easy to get, and this significantly reduced the chances of it being caught by any spam filter.

Thirdly, the phishing page itself is interesting in that it:

  1. Uses a standard link shortener for a redirect (http://ow.ly in this case)
  2. Which redirects to the phishing delivery page (a compromised page on http://zerra-performance-center.de)
  3. That page however rather than hosting the HTML phishing login page directly, does this:

<script type="text/javascript">
        window.location="data:text/html;base64,... base64 encoded version of HTML phishing login page ...";
</script>

That data: URL is itself the phishing page content, which includes links to real ANZ website logos to make it look as authentic as possible, but has a form submit action to a compromised page on http://lucinaracosta.com.br.

This approach is interesting because it makes it impossible to report this page as a forgery using the standard Firefox "Report Web Forgery" action because Firefox thinks it’s a data: URL. Neat trick that makes it harder to remove or block in the long run.

I’ve reported this issue as a Firefox bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1032564

Posted in Technical. Comments Off

Errors on classic mailbox screen and pop emails retrieved again for some users

A rollout of some new code today contained some errors that badly affected two separate areas of FastMail.

1. Errors selecting any emails to action on the classic interface would cause a fatal error

An internal misuse of an API meant that selecting any emails on the classic interface Mailbox screen and trying to action those emails would fail with a fatal error. Reading and applying actions to individual emails continued to work fine. The way this manifested itself unfortunately wasn’t picked up in our testing before being rolled out. It has been fixed now and we’ll update our tests to catch this. The error lasted for for about 3 hours.

2. Pop links for some users re-downloaded all emails again

A long term bug in the pop retrieve system resulted in a very rare and intermittent problem where some links for some users would forget all existing downloaded message. This means that in certain cases users that had set the “Leave on server” option might see existing messages that had been downloaded previously downloaded again, possibly several times. This obviously resulted in duplicate copies of the same message appearing in a folder.

Unfortunately a fix to this bug rolled out to one server for a short time actually made things worse, causing the same problem to occur for more users than the original bug.

A correct fix for the original problem and the subsequent bug has now been rolled out everywhere.

Users affected by this bug can find and remove any duplicate messages using the Advanced -> Folders -> Mass delete/Download/Remove duplicates … (button down the bottom) screen. Select the folder with the duplicates at the top, and use the “Remove duplicate messages” section to find and remove any duplicate messages.

Posted in Technical. Tags: . Comments Off

Increased spam getting through for the last few days

Due to an undetected compatibility issue between some software modules we use for detecting spam emails, for the last few days a number of the tests we use to detect spam haven’t been working properly. This means that for the last few days, considerably more spam may have been getting through our filters and into users Inboxes.

We’ve now fixed this issue and have added additional tests to ensure this doesn’t happen again.

For those interested in the technical details: We upgraded to a newer version of Net::DNS and the version of SpamAssassin we use was using some internals from Net::DNS that had changed with the new version. This caused all RBL lookups to fail. Failing RBL lookups wouldn’t cause any email delivery to fail, just all RBL scoring to be ignored.

Posted in News, Technical. Comments Off

New IP addresses in Iceland

We have just transitioned our Iceland servers to a new network range (belonging to us rather than to Opera).

This means that the addresses of ns2.messagingengine.com and in2-smtp.messagingengine.com have changed.

Unless you have hard-coded the old addresses somewhere, you shouldn’t see any difference. The new addresses will propogate over the next hour or so.

Posted in Technical. Comments Off

SSL certificates updated again

A few days ago we updated our SSL certificates. The algorithm used to sign these certificates (SHA256) presented problems with some older clients and operating systems, notably WebOS and Nokia devices. To fix this we got our CA (DigiCert) to re-sign the certificates using the older SHA1 algorithm, which should work pretty much everywhere. These certificates are now live on all of our domains.

Most users should not notice any change. If you are on a device or client where you’ve had to install the DigiCert root certificate in the last few days, you may need to do this again as these certificates are signed from a different root certificate. If that affects you, the root certificate is available from DigiCert’s root certificate page and is called “DigiCert High Assurance EV Root CA”. If you also need the intermediate cert, its available from the same page with the name “DigiCert High Assurance CA-3″.

Posted in Technical. Comments Off

When two-factor authentication is not enough

TL;DR: This is the story of a failed attempt to steal FastMail’s domains.

We don’t publish all attempts on our security, but this one stands out for how much effort was put into the attack, and how far it went.

We’ve had a handful of minor attack attempts recently. Targetted phishing emails to staff trying to steal credentials. An NTP-based DDOS which was quickly mitigated by NYI, our excellent hosting service.

These sorts of attacks are the “background radiation” of the internet. Along with port scans and entries in the web server logs from malware trying us out to see if we’re vulnerable to old PHP bugs (hint, we’re not). It’s the reality of being on the internet.

This blog post was first drafted before the Heartbleed fiasco. Sometimes, no matter how careful you are, you get a nasty surprise. We responded very quickly, as always. Anyway, on with this story.

About a month ago, our hostmaster@fastmail.fm account suddenly wound up subscribed to hundreds of mailing lists. All these mailing lists failed to use double or confirmed opt-in, so someone was simply able to enter the email address into a form and sign us up, no confirmation required. This really is poor practice, but it’s still pretty common out there. A special shout-out goes to government and emergency response agencies in the USA for their non-confirmation signup on mailing lists. Thanks guys.

The upshot was that the hostmaster address was receiving significant noise. Rob Mueller (one of our directors) wasted (so we thought) a bunch of his time removing us from those lists one by one, being very careful to check that none of the ‘opt-out’ links were actually phishing attempts. This turns out to have been time very well spent.

Internet identities

At FastMail, our security is central to the safety of our users. Given access to an email account, an attacker could reset passwords on other sites, including those which allow spending real money.

We take this responsibility very seriously, and we’re always looking for ways to improve our security.

Two factor authentication (2FA)

The Domain Name System is one thing that’s even lower layer and more central to identity security than the email server itself.

Based on recent articles in the tech press, we really wanted to have ALL our domains protected by two factor authentication.

Our domains were historically spread amongst multiple registrars. We chose to consolidate with Gandi because they have a great slogan (“no bullshit”), they support 2FA, and they support all the top level domains we require.

Robert Norris, one of our sysadmins, was in charge of the migration. He set up a corporate account with Gandi to get assistance in transferring the domains, and set up two-factor authentication at the same time.

Gandi uses the popular OATH TOTP (also known as “Google Authenticator”) mechanism. Rob wrote a small TOTP client and placed it with the key on our management servers in the secure storage area where we also keep our SSL certificates. The account password itself was encrypted in our password manager, which is stored separately.

Only a small number of trusted people have access to the credentials for our Gandi account. We are satisfied that this level of security is strong enough.

The attempt

On March 19th, this came to hostmaster@fastmail.fm:

From: Gandi Corporate Team
To: hostmaster@fastmail.fm
Subject: [RN1374-GANDI] Email address update request
Date: Wednesday, March 19, 2014 8:27 PM

Hello,

We received an email update request for the account RN1374-GANDI.

Previous email address: hostmaster@fastmail.fm
New email address: fastmail.fm@qq.com

If you are opposed to this modification, thank you for letting us know
only by replying to this email.

If you can read this message, then you can recover the password of your
account, and thus modify the email address of the handle. In that case,
we won't take care of your request.

Without any reply from your side, we will proceed within 24 hours.

Best regards,

Gandi Corporate

The hostmaster alias actually forwards to three of us, and we were all hyper-alert, so we thankfully noticed this email.

Within twenty four hours.

One day.

Gandi assure us that their fraud detection systems would have detected this, but for the 2 weeks it took from this email until we had full control over our account again, we were worried.

This request had completely bypassed our two-factor protection.

Forged source addresses

There is a well known problem in network security. You can’t trust the source address of an IP packet – they are trivially forged.

It’s the reason why we have source port randomisation, sequence number randomisation… all the things designed to stop an attacker being able to forge both an initial SYN packet and also the response to an ACK packet to bring up a TCP connection.

While they can falsify the source of a request, an attacker without full network control can not receive the response to their forged packet and continue the handshake.

This is why this email was such a surprise. Like the poor quality mailing lists mentioned above, it didn’t require a confirmed opt-in. We had to reply to say that we didn’t want the contact email address changed.

This means that a forged source address was sufficient. Even though the attacker couldn’t read email to hostmaster@fastmail.fm, they didn’t need to. All they needed was for us to not read it.

To Gandi’s credit, they responded very quickly to our “NO, DON’T CHANGE IT” email, and locked our account to stop any further shenanigans while they investigated and collected more documents from us.

Falsified documents

We discovered that Gandi received a paper email change form (pdf) claiming to be from a “Robert NORRIS” (the name which appears on our whois data), along with pictures of a passport of said “Robert NORRIS” and company registration documents also claiming to be for FastMail Pty Ltd.

At the time of writing, we are still in debate with the Gandi Legal Department about whether they can even show us these documents. They claim that French Law forbids them from showing us documents which purport to be from us. This is something to be aware of when choosing an vendor – different companies operate in different jurisdictions. There’s also a certain degree to which the conservatism of legal departments (protect the company as much as possible) conflicts with the corporate motto (“no bullshit”). The first response we got was certainly bullshit – “in order to meet a legal or regulatory obligation”. We challenged them to give an actual legal obligation and were given Article 226-15 of the French Criminal Code, along with rough English translation as follows:

“The act, committed in bad faith, of opening, deleting, delaying or diverting correspondence, whether or not it arrived at its destination, and addressed to a third party, or to fraudulently gain knowledge thereof, is punishable by one year of imprisonment and a fine of 45,000 EUR.”

We don’t believe that law is relevant – it’s the “no interception” law that exists everywhere, and doesn’t forbid anyone from quoting documents in replies to the purported source of those documents. If the law really was as Gandi Legal seem to be interpreting it, it would be illegal to quote an email in your response unless you were certain that the source address hadn’t been faked.

Was this a “security flaw”?

Security is built in layers, and I would definitely say that the fact that we received that email means one of the layers was weaker than it should be. Partly it’s poor choice of wording (Gandi claim that they would not necessarily have changed the email within 24 hours, depending on other investigations).

It still would have been necessary to either disable or reset the two-factor authentication on our account as well for the attacker to get full control. That’s difficult, but not necessarily impossible. After the fact, there’s no way to know how it would have gone down. We certainly weren’t willing to take the risk of doing nothing and seeing what happened!

What we do know is that the attacker was very determined, and willing to go as far as forging documents while simultaneously generating noise to make us less likely to notice the attack. They must have figured they had a chance.

Improving security for the future

A disadvantage of adding something like two-factor authentication after the fact is that you may miss the interactions with your existing processes. Gandi’s paper “email reset” form makes a lot of sense in the world where most of their customers are individuals or small businesses with one or two domains, and using addresses that they may lose access to. With no other factors, if they lose access to the email address and forget their password, there needs to be a process to regain access.

It’s always great to have a consistent process. Having a consistent process means that attackers can’t just try their luck until they find someone who is more trusting than average. Australia has a fantastic system called the 100 point check for authenticating people. We like process, consistently applied.

The problem we have is that we didn’t expect that the account email address could be changed without any reference to our two factors at all. Maybe nobody at Gandi realised either. That’s a security flaw – even if it doesn’t mean everything is totally broken.

We have had some very frank discussions with Gandi over the past week, and they agreed to make all three of the improvements we proposed as a result of these events:

  • the setting “disable password resets via email” was not on the security settings page of their website. Because of this, we hadn’t discovered and enabled it. They are moving it to the security page.
  • if an account has 2FA enabled, a red flag will automatically be raised against the request, meaning significant extra investigation will be done.
  • if an account has 2FA enabled, then an active confirmation will be required from the owner of the account before changing the email address. This means it will be harder to regain access if you lose all your factors, but that’s a good thing! Turning on 2FA means you want it to be hard for anyone who doesn’t have those two factors to gain access.

These steps will make attacks against Gandi accounts even more difficult in future, and we applaud their efforts to improve security and willingness to listen to our concerns.

There is one other measure that we have suggested which is still under discussion. Requiring the TOTP code to be entered on the password reset form, rather than using a secret question. We believe secret questions are bogus security, and we have an appeal to authority to back us up.

Gandi have blogged about this as well, and also given some general advice on keeping your account with them secure.

Conclusion

FastMail came out of this attack unscathed. Our domains are now even more secure, because Gandi has tons of proof on file about who we are and who our company is. Also Gandi’s processes have become more secure as a result of our experiences, so we are confident that we can safely keep our domains with them.

An important lesson learned is that just because a provider has a checkbox labelled “2 factor authentication” in their feature list, the two factors may not be protecting everything – and they may not even realise that fact themselves. Security risks always come on the unexpected paths – the “off label” uses that you didn’t think about, and the subtle interaction of multiple features which are useful and correct in isolation.

Posted in Technical. Comments Off

Cleaning up from an IMAP server failure

This blog post is highly technical. I cover details about how our email storage system works and how it was impacted by a complex server corruption and failure. I explain why our normal procedures failed in this instance, the ways in which our system helped us to track down and restore almost all the impacted emails, and some improvements we are making for our system to be more resilient in future.

Background

To show how things fit together, I’m going to start with an overview of our infrastructure and internal terminology. Some details have been shared before, but the exact setup changes over time. If you want to get straight on with the story, skip to the heading “The failure” below.

A slot

Each mail server is a standalone 2U or 4U storage server. Storage on each server is split into multiple slots (currently 15, 16 or 40 per machine depending on hardware model).

All current slots are called “teraslots”, consisting of:

  • A 1Tb partition for long-term email storage.
  • Space on a shared high-speed SSD for indexes, caches, recently delivered email, etc.
  • Space on a separate partition for long-term search indexes.
  • Space on a RAM filesystem for search indexes of recently delivered email.

All partitions are encrypted, either with LUKS or directly on the hardware if supported.

Every slot runs an entirely separate instance of the Cyrus IMAP server, complete with its own configuration files and internal IP address. Instances can be started and stopped independently. Configuration files are generated by a template system driven from a master configuration file, for consistency and ease of management.

A store

A store is the logical “mail server” on which a user’s email is stored. A store is made up of a number of slots, using Cyrus’ built-in replication to keep them synchronised. At any time, one slot is the master for a store, and it replicates changes to all the other slots.

A normal store currently consists of three slots, two in New York and one in Iceland. To move a slot between machines, it’s easiest to configure a new replica slot, wait until all data is replicated and the new slot is up to date, and then remove the unwanted replica slot. In the case where we’re moving things around, a store may consist of more slots – there’s no real limit to how many replicas you can run at once.

We spread the related slots of different stores such that no one machine has too many “pairs” to another machine. This means that if a single server fails, the load spreads to many machines, rather than doubling the load on one other box.

For each store, one slot is marked in our database as the “master slot”. In the past we used to bind an IP address to the master slot and use magic IP failover, but no more. Instead, all access is either via a perl library (which knows which slot is the master) or via the nginx frontend proxy, which selects a backend using the same perl library during login.

Replication

The cyrus replication system doesn’t record the actual changes to mailboxes: it just writes “something changed in mailbox X” to a log file (or in our multi-replica configuration, to a separate log file per replica).

A separate process, sync_client, runs on the master slot. It takes a working copy of the replication log files that other cyrus processes create. If the sync_client process fails or is interrupted, it always starts from the last log file it was processing and re-runs all the mailboxes. This means that a repeating failure stops all new replication for that replica, which becomes relevant later.

The sync_client process combines duplicate events, then connects to a sync_server on the other end and requests some information about each mailbox named in the log. It compares the state on the replica with the state on the master, and determines which commands need to be run to bring the replica into sync, including renaming mailboxes. Mailboxes have a unique identifier as well as their name, and that identifier persists through renames.

We have extensively tested this replication process. Over the years we have fixed many bugs and added several features, including the start of progress towards a true master-master replication system. We still run a task every week which compares the state of all folders on the master and replica, to ensure that replication is working correctly.

Failover

Our failover process is fairly seamless. If everything goes cleanly, the only thing users see is a disconnection of their IMAP/POP connection, followed by a slow reconnection. There’s no connection error and no visible downtime. Ideally it takes under 10 seconds. In practice it’s sometimes up to 30 seconds, because we give long-running commands 10 seconds to complete, and replication can take a moment to catch up as well.

Failover works like this (failover from slot A to slot B):

  1. Check the size of log files in replica-channel directories on slot A. If any are more than a couple of KB in size, abort. We know that applying a log file of a few KB usually only takes a few seconds. If they’re much bigger than that, replication has fallen behind and we should wait until it catches up, or work out what’s wrong. It’s possible to override this with a “FORCE” flag. We also check that there’s a valid target slot to move to, and a bunch of other sanity stuff.
  2. Mark the store as “moving” in the database.
  3. Wait 2 seconds. Every proxy re-reads the store status every second from a local status daemon which gets updates from the database, so within 1 second, all connection attempts to the store from our web UI or from an external IMAP/POP connection will be replaced by a sleep loop. The sleep loop just holds off responding until the moving flag gets cleared, and it can get a connection to the new backend.
  4. Shut down master slot A. At this point, all existing IMAP/POP connections are dropped. It will wait for up to 10 seconds to let them shut down cleanly before force closing the remaining processes.
  5. Inspect all channel directories for log files on slot A again. Run sync_client on each log file to ensure that they sync completely. If there are any failures, bail out by restarting with the master still on slot A.
  6. Shut down the target slot B.
  7. Change the database to label the new slot B as the master for this store.
  8. Start up slot A as a replica.
  9. Start up slot B as the master (this also marks the store as “up”).

Within a second, the proxies know that the store is available again and they continue the login process for waiting connections.

Unclean failover

In the case of a clean failover, all the log files are run, and the replica is 100% up-to-date.

If for some reason we need to do a forced failover (say, a machine has failed completely and we need to get new hardware in), then we can have what’s called a “split brain”, where changes were written to one machine, but have not been seen by the other machine, so it makes its own changes without knowledge of what has already happened.

Split-brain is a common problem in distributed systems. It’s particularly nasty with IMAP because there are many counters with very strict semantics. There’s the UID number, which must increase without gaps (at least in theory) and there’s the MODSEQ number, which similarly must increase without changes ever being made “in the past”, otherwise clients will be confused.

Recovering from split brain without breaking any of the guarantees to clients was a major goal of the rewrite of both the replication protocol and mailbox storage which I did in 2008-2009. These changes eventually lead to Cyrus version 2.4.

Anti-corruption measures

We also want to be safe against random corruption. We have found that corruption occurs a couple of times per year across all our servers (the cause of this is hard to say, most likely hard drive or RAID controller issues from what we’ve seen) and we were bitten hard in the past by a particularly nasty linux kernel bug which wrote a few zeros into our metadata files during repack.

Since then we have added checksums everywhere. A separate crc32 for each record in a cyrus index, cache or db, and a sha1 of each message file. Cyrus also sends a crc32 of the “sync state” along with every mailbox, allowing it to determine if the set of changes did actually create the same mailbox state during a sync. A sync_crc mismatch triggers a full comparison of all data in that mailbox, allowing sync_client to repair the error and resync the mailbox.

And now you know enough about our architecture to understand what happened!

The failure

On Thursday 27th February, at 4:30am Melbourne time, Rob N (the on-call admin for the night) was paged by our monitoring system because one of our servers (imap21) was not responding. You can read his initial blog post about the incident as well.

He initially thought it was just a crashed server – there are various reasons why complex systems crash, and it wasn’t obvious which one it was. Our normal procedure is to restart the server and restart all the IMAP slots, replicate any remaining changes off the machine, then fail over to replicas while the search indexes catch up.

Unfortunately, things were worse than that. It started up with significant corruption on the operating system partition, and one of the 15 IMAP partitions failed to mount. At this point he force-failed-over (meaning moving the IMAP server master to its New York replica without doing a full sync first) and went back to bed to get some sleep and deal with things in the morning.

Initial cleanup

I was away during the first couple of days of this, on school camp with my daughter in one of the few places in the world where you can’t even get phone reception, but I have notes from the others and log files to go on.

On Thursday morning, there were a number of support tickets from people who were missing a few days of recent email. A quick look at the logs showed that one of the 11 master slots (of the 15 on this server, two were empty and two were replicas) had not been replicating for a couple of days.

The cause of this was an edge case issue with our replication system. If there is a loop in mailbox renames (i.e. folder ‘A’ was renamed to ‘B’ at the same time that folder ‘B’ was renamed to ‘A’, or a more complex circle of names), then the replication system bails out and tells us it needs to be fixed.

Normally I would notice this, and the others knew about the possibility as well, but over the previous few weeks I had been moving slots around in Iceland as part of our efforts to split entirely from the Opera network while retaining full offsite redundancy. There had been a lot of cyrus noise in the notification channel, and nobody picked up that this noise pointed to a more significant issue.

As mentioned above, this single issue in the working log file means that no new events can be processed. Interestingly, changes to other mailboxes in the same file might get replicated, because it will re-visit them over and over, but other newly-changed mailboxes will not be noticed.

To make things worse, the failed channel was also the “new user server”. Every week a task runs , calculates which store is least loaded, and sets a database field to direct newly created users to that store. So there were brand new users, and some of them had used our import facility to bring large amounts of email in.

Re-replication

The usual fix in this sort of case is to re-run the log files, even though the slot is now a replica. Because we keep a sha1 of the content of each message, the replication system can detect different files at the different ends. It can then fix up mailboxes by re-injecting BOTH messages with new UIDs, higher than any yet used, and then expunging the original UID at both ends. This means that no matter which one a client saw first, it now gets a repaired view of the mailbox with both messages in it. I wrote a long justification for this logic, and why we do it, which you can read at https://lists.andrew.cmu.edu/pipermail/cyrus-devel/2010-July/001500.html.

We also use the same method in reconstruct to repair from a damaged mailbox on disk, but I’m about to re-visit that based on the below.

So, for messages within an existing mailbox, almost all were restored by this, except where things were broken for other reasons.

Bogus reconstructs

Because sync wasn’t working, the team in Melbourne started running reconstructs to repair broken folders. In theory, this was a great idea – it’s what I would have recommended.

In practice though, the data on the machine was badly broken in subtle ways, and this just made things worse. Looking through the log files, I see numerous cases where reconstruct determined that the stored sha1 of the message was not the same as the sha1 of the file on disk, and so injected the message again with a new UID. This should never happen unless the file is actually corrupted. Looking at the files on disk now, their sha1 is actually the old value, not the new one that it calculated during the reconstruct.

So the machine either has a faulty CPU, faulty memory, dodgy RAID card, intermittently faulty disk… whatever it is, it’s pretty horrible! We will be able to test it more once we’re happy that all the salvageable data is copied off.

Due to the faulty hardware, the attempts to make things better actually made things worse!

The worst cases were where reconstruct decided that a folder didn’t exist, and wiped it from the mailboxes database. The replication engine then wiped the copy on the replica. Likewise, where mailboxes didn’t exist at the other end due to not being replicated yet (see above), they got wiped from the original copy. These are the cases where, for 49 users, we can see that some emails were lost entirely.

Thankfully, repeated checksum mismatches caused replication to bail out in many places, saving us from a lot worse pain.

Time to cut our losses

On the morning of Friday 28th when I got back into phone range, pulled out my laptop on the bus home and inspected the damage, sync was still bailing out everywhere.

First thing I said was “turn off all the replication and reconstructs – let’s stop making things worse”.

I had an empty machine (imap14) which I had just finished clearing out in New York (see above about all the noise – lots of reconfiguring to get everything into consistent teraslots). I configured new replica pairs on imap14 for all the slots that used to be paired with imap21. It took about 4 days to get everything fully replicated again (moving that much data around, including sanity checking and building new search databases takes time), but we’re now fully replicated again, nothing fewer than 3 slots for each store, one offsite.

I left imap21 up, but with nothing talking to it or trusting it any more.

Examining the mess

The first job was to determine the extent of the damage. Thankfully, the log partition on imap21 wasn’t damaged, so we had backup copies of all the log files from everywhere on our log server by the time I got home. These aren’t the ‘sync log’ files from above; they’re the general syslog used to monitor and audit the everyday operations of the system. We keep them around for a few months to help track down the history of problems.

The first step was to identify what had happened, and locate any messages that hadn’t been replicated. We syslog important actions within Cyrus, in a consistent format with the label ‘auditlog:’.

Here’s an example of a recent message appended to my mailbox and then moved to Trash:

2014-03-10T03:11:04.544356-04:00 imap20 sloti20t12/lmtp[23349]: auditlog: append sessionid=<sloti20t12-23349-1394435464-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:31.374972-04:00 imap20 sloti20t12/imap[22916]: auditlog: append sessionid=<sloti20t12-22916-1394436442-1> mailbox=<user.brong.Trash> uniqueid=<24bbf0f44475158a> uid=<1157077> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:31.390532-04:00 imap20 sloti20t12/imap[22916]: auditlog: expunge sessionid=<sloti20t12-22916-1394436442-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>

(The sessionid field can be used to track the exact login time, IP address, etc – and for deliveries it can link through the logs to find the server which sent it to us originally).

It was appended to my INBOX via LMTP, then appended to my Trash folder and expunged from my Inbox. The ‘guid’ field is the sha1 of the underlying message.

Over on my New York replica slot:

2014-03-10T03:11:04.544356-04:00 imap20 sloti20t12/lmtp[23349]: auditlog: append sessionid=<sloti20t12-23349-1394435464-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:31.374972-04:00 imap20 sloti20t12/imap[22916]: auditlog: append sessionid=<sloti20t12-22916-1394436442-1> mailbox=<user.brong.Trash> uniqueid=<24bbf0f44475158a> uid=<1157077> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:31.390532-04:00 imap20 sloti20t12/imap[22916]: auditlog: expunge sessionid=<sloti20t12-22916-1394436442-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>

And of course, the same in Iceland:

2014-03-10T03:11:06.127098-04:00 timap5 slotti5t02/syncserver[6838]: auditlog: append sessionid=<slotti5t02-6838-1394336626-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:33.575200-04:00 timap5 slotti5t02/syncserver[6838]: auditlog: expunge sessionid=<slotti5t02-6838-1394336626-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:33.833070-04:00 timap5 slotti5t02/syncserver[6838]: auditlog: append sessionid=<slotti5t02-6838-1394336626-1> mailbox=<user.brong.Trash> uniqueid=<24bbf0f44475158a> uid=<1157077> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>

Anyway, we found something else interesting in the log files. For nearly 24 hours before the hard crash, there had been random transient failures in the replication logs. It would attempt to replicate a change, get a checksum error, try to replicate the exact same change, and the second time it would work! The machine was obviously intermittently failing well before it finally crashed.

We also found that two of the other master slots had encountered replication errors before the hard crash. The errors had caused replication to repeatedly bail out and they were also falling behind – this was enough noise that anyone would have noticed something was wrong – but it started about 2am our time, and everyone was asleep.

Based on user tickets and the state of the machine before the crash, we knew that there was still email missing. It was time to try to identify it. We determined the following algorithm to find messages that needed to be recovered.

  1. If there was an ‘append’ on imap21 via something that was NOT sync_client or syncserver, but no matching append on the replica with the same guid, that’s bad.
  2. If there was an ‘expunge’ on the replica by a sync program that can’t be matched to an expunge by an imap or pop3 process, then that was a message which shouldn’t have been wiped (but see above about the re-UIDing process; if we find another append to the same folder with the same GUID in either case, then the lower-UID one isn’t interesting any more).
  3. If a folder was deleted by a sync tool without a matching intended delete, then we have a problem.

Replication – the bad bits

I discovered, to my horror, that when replicating an expunged event from the master, the sync protocol ignored the sha1 of the message. It should either have checked that it was the same message file that was expunged on the replica as well, or just bailed out. This meant it would expunge a never-seen message from the replica if the same UID had been seen and then deleted on the master. Ouch.

This is also a flaw with folder deletes. When a deleted folder is replicated, it just removes the folder and message files from the disk on the replica. That’s wrong and should never be allowed. To ensure we can always recover accidentally deleted mail, any deleted folders are supposed to be moved to a special DELETED namespace. They’re only permanently deleted a week later by a cleanup process.

The end result of all this: some new folders created after the failover from imap21 were irreversibly deleted when we tried to sync over the missing messages from imap21.

Recovery of messages

In either case, an append that never synced or an expunge that wasn’t justifiable, the fix is the same. Find a file with the same sha1 in any of that user’s mailboxes, or their backup, or on disk on imap21, and that’s your message. We did that for nearly half a million emails in total, over half of those belonging to 3 users who had just imported a large amount of email from external services.

After that, we were left trying to recover a few folders which got wiped. The fix in that case is to recover from backup (hopefully most of the messages) and then search for any appends since the highest UID in the backup, and append those as well (unless a justifiable expunge can be found for them).

The great thing about the log file containing a GUID which is the sha1, is that you can determine with cryptographic certainty that you’ve found the right file. We pulled files from imap21 (don’t have to care about corruption, the sha1 will check it), files from the replicas, and even the backup server for the user to look for matching messages.

Some statistics

  • 370 affected users
  • 458471 found messages
  • 29837 lost messages (no sha1 file able to be found)
  • 32 lost folders (deleted, nothing in backup)
  • 49 users with SOMETHING lost

We have emailed all affected users with details of exactly what was lost and recovered for their account.

Making sure this does not happen again

We take our uptime and reliability very seriously. Losing anything is a major failure for us, and we are determined to take steps to ensure this situation can not happen again.

Operational changes

  • We now page the on-duty if replication falls more than 5 minutes behind to all replicas.
  • We alert (non-emergency) if ANY replica falls more than 5 minutes behind. This alert is made to look different from the Cyrus notices that can happen during regular maintenance, so everybody will be aware that there is an issue.
  • We will no longer attempt to reconstruct or keep replication going if we suspect a faulty server (at least until replication is super-safe). We’ll be much quicker to just declare the machine faulty and recover messages independently.

Software changes

  • The immediate fix is to do the GUID test even if the message was expunged on the master. This avoids the nasty case of expunging an unseen message during split brain.
  • The larger fix is two-pronged. First make reconstruct a lot less likely to damage existing mailboxes in case of corruption. There’s no point re-injecting a corrupt message file, nobody wants a file full of random blocks off disk. If the sha1 doesn’t match, then abort and get the admin to fix the permission problems or check the disk first.
  • The second prong is changes to the replication system, so that it’s impossible to make the replica delete message files immediately via the replication protocol. The most you will be able to do is make it move them aside for later removal on the scheduled rotation (we run on Saturdays and delete things that have been marked for deletion for at least one week at that point – so between 1 and 2 weeks).
  • Finally, we want to integrate the ability to fetch files from a replica to repair a corrupt local copy. We already have a separate magic perl script that can do this, but it runs outside Cyrus, with its own ugly locking tricks to force Cyrus to accept what it’s doing. It would be much neater to have this integrated into reconstruct, so that it’s replication-aware.

Finally

This has been a long saga (2 weeks from failure to restoring everything we could), and a learning experience for everyone involved. Our detailed logging and checksums meant we were able to recover the vast majority of messages affected by the corruption, but we are obviously unhappy that we lost anything, and we will be taking the steps outlined above to prevent this issue happening again.

We apologise to the 370 users who have been without some email for over a week while we took our time making sure we fully understood what had happened and what needed to be recovered. We’re especially sorry for small number of people for whom we lost emails irretrievably. We have contacted these people individually.

We’re proud of our reliability track record at FastMail and we respect the trust that people place in us to store their email. We are working hard to restore and maintain that trust, both by being open about exactly what happened with this incident and by updating the system and processes we have in place to improve our resilience in the future.

Posted in Technical. Comments Off

Diary of an outage

As some of you are no doubt aware, yesterday we had a fairly serious outage. It only affected a small number of users, but for them it meant some some 4-6 hours with partial or no access to mail. In this technical post I’ll be explaining exactly what happened and what we are doing to fix the problem and ensure it doesn’t happen again in the future.

For the non-technical readers, you can skip to the last paragraph.

Early morning page

At around 4:30am Melbourne time (around 17:30 UTC) I was paged by our automated monitoring tools. We have extensive monitoring of most aspects of our infrastructure, and the system has permission to wake people up if it notices a problem. In this case, it noticed that one of our many backend mail servers, imap21, was no longer reeachable. Unlike many other service providers we don’t do automatic failovers because if the software makes the wrong decision it can lead to worse problems than might otherwise occur. We prefer to put a human into the loop to make sure things look sane before taking any action.

A machine failure of some sort is not as uncommon as you might think. Usually it’s a single disk failure, but we’ve variously had disk controllers fail, power supplies fail, and kernel or other bugs cause machines to crash. In the event of a crash, the machine usually reboots. The normal procedure for a night-time failure is to failover all mail stores on the machine to replicas on other machines. That’s anywhere between 15 and 40 stores depending on hardware configuration. We prefer to return service to the original machine if possible so we can maintain the replicas and thus keep our redundancy level, but that’s usually not possible in the case of a hardware fault – the machine is no longer fit for service.

The failover process involves shutting down both the current “master” slot and the target replica slot, reconfiguring them such that the master becomes the replica and the replica the master, and then starting them both up. The system configuration database is also updated so that all other services (web client, IMAP/POP frontends, mail delivery) know where to find the stores now. This is all done by a scripted process that can be initiated with just one command. In the event that the current master slot is unavailable, that part is skipped.

This morning imap21 failed in a fairly serious way. It was not available on the network. The management console showed that it tried to reboot, found corruption on the operating system disk, tried to repair it and couldn’t and was waiting for manual intervention. That immediately told me this was a serious problem and the best course of action was not to try and repair it, but instead failover its 15 slots to replicas. I did this, checked our monitoring to confirm all systems were operating normally, and went back to bed. I knew that full repairs would potentially be a big job, and I wanted enough sleep to be able to do it without messing things up further. As far as I was able to tell service had been fully restored to users, which is the single most important thing.

Replication logs

Fast forward to 9am. I arrived at the office after a bit more sleep at home and on the train (you take what you can get!). On arrival I was told that a number of users had been reporting either missing mail or errors on login. The first thing I thought of turned out to correct: at least one of the replicas I had failed over to were not up to date. But first we had to prove it!

The Cyrus replication model is an “eventually consistent” one, fairly common among database server software (which is what Cyrus is, though optimised for email storage). When some action is performed (delivering a message, creating a folder, moving a message, etc), the operation is performed on the local data store, and a record of the operation is written to a replication log. A separate pair of programs (sync_client and sync_server) use that log to perform the same actions on the replicas. Typically that happens in near realtime, but if replication was in fact behind, then we should see lots of unprocessed lines in the replication log.

Getting to the replication log was something of a problem because as noted above, imap21 refused to boot. A filesystem check was not pretty – literally thousands of errors. That’s quite worriesome as this is a root disk which generally shouldn’t have any corruption on it (it almost never recieves writes), but figuring out why the crash happened is something of a secondary concern at this point. So I reinstalled the OS into the root partition. That’s something we do all the time and it’s fast and accurate. Twenty minutes later it was fully reinstalled and the machine was up and running, and we got to inspect the replication logs.

The good news was that 14 of the 15 stores on this machine had fully up-to-date replicas, so there was only one mail store to deal with. The bad news is that the remaining store had some 2-3 days worth of unreplicated events in it’s log. To add more pain to the situation that store was coincidentally the current “new user” store, where new users are created. That means that any users that had signed up in the last 2-3 days did not exist at all on the replica, thus the reported login errors. A horrible first experience!

Examining the log, we discovered that right at the top there was a complex series of folder renames within a single replication event. This is not a particularly unusual operation. This time it tripped a known, rare bug in the way renames are replicated that caused the replication process (sync_client) to abort. The Cyrus master daemon starts it up again, but then it hits the same point and dies again, over and over. Replication stops.

Fixing this is on our list of things to do. Because it’s a fairly rare thing to happen and usually gets dealt with quickly, it hasn’t quite made it far enough up our list to deal with. Obviously that has now changed, but lets talk about why it wasn’t dealt with this time.

Monitoring the monitors

As noted, we have a lot of system monitoring running, and we place a huge amount of faith in it. It’s almost by definition that if there’s no problems being reported, then there are no problems, at least in overall system health. Of course individual per-user problems can appear from time to time, which is unfortunate. A support ticket is the usual way to get to the bottom of individual problems like that.

Something we monitor is replication lag, which is almost entirely based on the size of the replication logs. If they grow “too large”, a low-level warning is produced. A warning of this type results in a message being posted to our IRC channel and an email being sent, but it will never generate a page. It’s the kind of thing that we look at and action every now and again, and as noted it’s a fairly rare event under normal operation.

There is however one time where that warning can occur yet not be a problem, and that’s when users are being moved around a lot. User moves are done via the replication system, and when you’re doing a lot of them at once it can generate a lot of replication log traffic, sometimes causing replication to lag significantly for short periods of time. Something we’ve been doing for the last few months is redesigning our disk layout into something much easier to reason about and work with, and that has required a lot of moves. This particular warning has not been as uncommon as it should have been recently, leading to the situation where we’ve started ignoring it.

Obviously this is a dangerous place to be. I recognise how bad that sounds – “system warnings were ignored” – but this is what happens when you have a warning that doesn’t quite match up with the importance of the situation. Think of it like the fuel light on your car. It needs to be calibrated to come on at just the right time. Too late, and you run out of fuel before you can refill. Too early, and you start to learn that it isn’t really a problem; you won’t run out of fuel for ages. At that point you might as well not have the light at all. It’s not a perfect analogy, but it’s instructive.

So far we haven’t figured out the original cause of the filesystem corruption (we suspect a hardware failure that isn’t visible to our normal tools) that led to the crash. But that’s not quite the point. Had appropriate action been taken when the replication lag was first noticed, we would have had fully up-to-date replicas at failover, and this entire situation would have been little more then a ten-minute outage and me being a little tired for the day.

Now what?

Obviously, we have a few things to do out of this!

  • The replication lag monitor needs to be able to send a page when things are getting bad. Our current thinking is that it should page if it doesn’t have a least one replica less than five minutes behind.
  • It also needs to understand that there are other things that can cause lag and compensate, which means it needs to know what user move operations are in progress. That said, replication falling behind is always a little dangerous, so we might be better to somehow change the way moves happen so that we always have at least one viable replica at all times. We’ll need to consider this in more detail.
  • We need to make the replication system able to cope with hard replication failures. One idea we’re currently considering is to put failed replication events into another logfile, and come back to them once the main log is empty. This needs a bit of thought, particularly around the consequences of operations being applied out-of-order.
  • We need to fix the renaming bug. Having a last-ditch protection against replication problems is great, but even better than handling problems is avoiding them in the first place.

Conclusion

It’s now 13 hours since the problem was fully understood and recovery began. We believe that most if not all missing mail and broken accounts have been repaired. If you were unfortunate enough to be affected and you’re still seeing problems, please contact support.

We are very sorry for this outage. We understand how important your email is and how much it affects you when it’s unavailable. We’re proud of our track record on reliabilty but we know we’ve dropped the ball on this one. To our new users, who are the ones most likely affected by this outage, we understand very well how this makes us look. We’re working hard to get the situation resolved and to restore your trust. Thank you all for your patience and understanding.

Posted in Technical. Comments Off

Content Security Policy now on Beta

At FastMail, we’re always looking to increase security for our users. Cross-site scripting (XSS) attacks are one of the dangers that all websites must take care to mitigate against. HTML email is the highest risk for all webmail providers. Before embedding it into a page, it must be carefully checked and any potentially malicious content removed. In particular, all scripting content must be removed otherwise an attacker could gain access to your account and email.

Due to the complex nature of HTML parsing and encoding, there are many ways that a malicious email might try to sneak through scripting content. That’s why we fully parse the HTML first on the server and sanitise it against a white-list of known-good tags and attributes. This ensures that any scripting content is stripped, and other ambiguous content is properly escaped and encoded.

We’re very careful, and we have lots of tests to ensure we protect against all known techniques for trying to embed scripts. However, there’s always a possibility of bugs in any software, and Content Security Policy, also known as CSP, is a new HTTP header that provides an extra layer of defence against these types of attacks.

With CSP, we can instruct all modern browsers to only ever load scripts from our own website. Any references to remote scripts or "inline" scripts will be blocked. This means if a malicious email somehow slips through our filters, the browser still stops it from doing anything dangerous.

We’ve just rolled this out on our beta server (https://beta.fastmail.fm) for testing. We hope to roll it out everywhere soon. If you use our beta server, please let us know of any new issues you notice by emailing betafeedback@fastmail.fm. Some browsers may have issues with extensions. These should be allowed to run according to the spec, but some browsers
may prevent them from doing so as a violation of the content security policy. If you have a problem with an extension at FastMail, please first try updating to the latest version. If the issue still persists, please let us know so we can contact the extension authors.

Posted in Technical. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 5,003 other followers