New XMPP/Jabber server

This is a technical post. Fastmail users subscribed to receive email updates from the Fastmail blog can ignore this post if they are not interested.

We’ve just replaced the XMPP/Jabber server we use for our chat service. Previously we had been using djabberd. While this worked well for us for the last few years, unfortunately it hasn’t been receiving much development recently. This means many newer XMPP extensions aren’t available.

We looked at a number of alternate server options: Tigase, Prosody, ejabberd, OpenFire. In the end, we settled on ejabberd because of it’s relative maturity, good administration documentation, it’s widespread use in existing large installations, the active development community and it’s support for multiple domains (in the newest version).

Fortunately our existing architecture separated the XMPP/Jabber server from the backend storage details of our system (eg. user lists, user rosters, chat logging, etc) with an HTTP JSON API. Because of this, it was fairly straightforward to completely remove djabberd, write the equivalent interfacing components for ejabberd and slot that into place. A perfect two month piece of work for our summer intern student Samuel Wejeus. Thanks Samuel!

That work has now been done, and yesterday we completely removed djabberd and replaced it with ejabberd. For users that use our chat service, there shouldn’t actually be any noticeable difference at this point, everything should just continue to work as it did, but with this new base we should be able to add more features in the future.

Posted in News, Technical. Comments Off

Secure Login now only login button

When Fastmail started in 1999, https (secure SSL connections over http) was still a relatively young protocol, originally released in 1994. It was considered fairly computationally expensive, and though technically supported by most popular browsers, there were numerous performance and compatibility problems (eg. early browsers never cached any https resources). So although we always offered https as an option via the "Secure Login" button, the default button was just a regular "Login" over unencrypted http.

Of course since then there’s been massive changes in browsers, computing power and average level of security required on the Internet (eg. Firesheep). Because of this, https connections are now recommended for all logins, and we’ve defaulted to "Secure Login" being the primary login for some time.

As a further step, today we’re making "Secure Login" the only button. There’s really no reason these days not be using a secure login and having all data encrypted between your computer and our server.

For the very, very few cases where you might not want a secure login (eg. we’ve heard of some people in certain countries or on certain company networks having problems with https connections), you can click the +More link and there’s a link there that will switch you to a non-secure login screen. We highly discourage using this, however it’s an option of last resort if you need it.

Posted in News. Comments Off

Moving to a new credit card provider, may generate a small temporary charge

TL;DR: You may see a $1 charge appear on the credit card you have registered at Fastmail. This is temporary as part of a conversion process and will completely disappear within a few days. There is nothing you need to do.

We’re currently in the process of moving our credit card payment gateway to a new provider, Global Collect. Global Collect are a large and well respected provider of international payment services that will allow us to support more cards and payment methods in the future.

As part of the switching process, we’re securely converting all credit card details from being stored on Fastmail systems, to being stored at Global Collect. By doing this, we’ll be able to remove all credit card details from our system, something we’ve wanted to do for a while.

However there is one issue. To enter the credit card details into their system, we can’t just enter them as is, we have to enter them with a transaction so that the card can be checked and authorised.

We’re doing this by creating a dummy $1 authorisation against each card. With credit cards, it’s possible to create an authorisation on a card, but never actually complete the transaction. After a few days the authorisation times out, and the money is never actually taken from the account.

However the $1 authorisation is still something that banks will do their usual fraud checks against, potentially alerting you to the transaction via email or a phone call, and potentially having the transaction appear (temporarily until it time out) on your online statement. Because of different gateways, the payment on your statement may appear to come from either "Fastmail" or "Opera Software".

If you do see something like this occur on your credit card, then there’s no need to worry. This is just a consequence of the transfer, and the $1 charge will completely disappear from your credit card in a few days. In some cases, it’s also possible that there may be two separate attempts to charge $1, because Global Collect will route payment attempts through multiple gateways if one appears to fail. Again, there is no need to worry about this.

We’re sorry for any inconvenience this has caused some people, we didn’t realise up front the full issues this would cause some users, especially the surprising contacts from their bank, we thought it would be basically an invisible process for all users.

At this point there is nothing you need to do. If your bank contacts you about the charge, you should tell them it’s ok and to process it. The $1 charge will completely disappear after a few days.

Once the conversion is complete, all Fastmail payment services should continue as normal. Account renewals should be automatic (unless you’ve explicitly disabled them), and you should be able to add funds/upgrade/downgrade from the appropriate Options screens.

Long term we want to add additional payment options as supported by Global Collect, the first of these is likely to be Paypal sometime in a few months.

Posted in News. Comments Off

Outage report – a cascade of errors

On Friday we had one of the worst outages we’ve had in over 3 years. For at least 3 hours, all accounts were inaccessible (web, IMAP and POP), and for a few users, it was several hours longer than that. For some other users, there were additional mailbox and quota problems after that.

Obviously this is something we never want to happen, and over the years we’ve setup many systems to avoid outages like this occurring.

Summary

A small “trivial” change to a configuration program that was rolled out caused a cascading series of events that resulted in some important files being corrupted. We had to take down all services and rebuild the corrupted files from the last backup and add in any changes since the backup. Once rebuilt, we were able to bring back up all services. A separate corruption issue that affected a few users caused some longer outages and quota issues until we fixed those mailboxes.

Future mitigation

We’ve identified the chain of events that caused the problems. Because it’s actually a chain of events, we’ve identified at least 5 separate issues to fix, so that this problem, or something similar to it won’t happen again, and we’ll be doing those over the next week.

Of course we can never be 100% sure that there aren’t other cascade event paths that will cause outages, but by learning from past mistakes, fixing the known problems, continuously enhancing our test infrastructure, and being more aware of possible consequential errors in the future, we’re always aiming to minimise the chance of them occurring and providing the highest reliability possible.

Technical description

Although it’s draining and frustrating to see a large problem like this occur that so badly affects our users, it’s been fascinating to actually investigate this problem. What we end up seeing is how a set of little mistakes, bad timing, decisions made long ago, and human error (all of which are wonderfully obvious in hindsight) end up causing a much bigger problem than the initial trigger would ever suggest.

The domino effect

In this case, the problem stemmed from a cascading sequence of issues that started with a single misplaced comma.

  1. Cyrus configuration file error. The underlying trigger of the cascade was a single misplaced comma in a configuration file. In this case the error was detected by the developer during testing, who fixed the problem, but unfortunately pushed the fix to a different branch.
  2. Core dump behaviour of fatal(). The effect of the broken configuration file was that immediately after forking a new imapd process, it would try to parse the configuration file, fail to do so, and call the fatal() function. Normally that would just cause the process to exit. However our branch of cyrus has a patch we added that means all calls to the fatal() function dump core instead of just exiting; this is normally very useful for debugging and quality control.
  3. Kernel configured to add pid to core files. We also configure our kernels with the sysctl kernel.core_uses_pid=1 which ensures that each separate process crash/abort() generates a separate core file on disk rather than overwriting the previous one. Again this is very useful for debugging.
  4. Cyrus master process doesn’t rate limit forking of child processes. The cyrus master process that forks child processes doesn’t do enough sanity checking. Specifically, if an imapd exits immediately after forking, the master process will happily immediately fork another imapd, despite there being zero chance that the new imapd will do any better. At the very least this leads to a CPU-chewing loop (as well as a non-functional imapd) as each forked imapd process immediately exits and the master creates a new one.
  5. Core files end up in cyrus meta directory. Cyrus supports the concept of separating meta-data files from email data. This is very useful as it allows us to place the small  but "hot" meta data files on fast (but small) drives (eg. 10k/15k RPM drives, or SSD drives in new machines), and place the email data files on slower and much larger disks. The "cores" directory where core dumps end up is located on the same path as the meta data directory.
  6. Cyrus skiplist database format can corrupt in disk full conditions. Cyrus stores some important data in an internal key/value database format called skiplist. The most important data is a list of all mailboxes on the server. This database format works very well for the way cyrus accesses data, it’s been very fast and robust. However it turns out the code doesn’t handle the situation where a disk fills up and writes only partially succeed, causing database corruption.

Putting all the above together, creates the disaster. A small configuration change was rolled out. Every new incoming IMAP connection would cause a new imapd to be forked and immediately abort and dump core. Each core file would end up with a separate filename. This very quickly caused the cyrus meta partitions to fill up. They reached 100% full before we fully realised what was happening. This caused changes to the mailboxes database to only partially write, causing a lot of them to become corrupted.

When we realised this is what had happened, we quickly stopped all services, undid the change, and tried to recover the corrupted databases. Fortunately the databases are backed up each half hour, and there are replicas as well. Using some libraries, we were able to quickly put together code that pulled any still valid records, records from the backup, and records from the replicas and combined them, and rebuilt the mailboxes databases, and then started everything back up.

Adding insult to injury

Fortunately for most people, the mess stopped there. Unfortunately for a few users, there were some additional problems as well.

As well as the mailboxes database corruptions, it was discovered that the code that maintains the cyrus.header and cyrus.index files also didn’t like the partial writes that disk full conditions generate. This caused a small number of mailboxes to be completely inaccessible (Inbox select failed).

Fortunately cyrus has a simple utility to fix corruptions of this form called "reconstruct", so we ran that to fix up any broken mailboxes. Fixing up a mailbox with reconstruct however doesn’t fix up quota calculations, and that has a separate utility "quota" that you can run with a –f flag to fix quotas. We ran that on users to make sure all quota calculations were correct.

Unfortunately there’s a bug in the quota fix code that in some edge cases can double the apparent quota usage of users. This caused a number of accounts to have an incorrect quota usage to be set on their account, and in some cases caused them to go over their quota, causing new messages to be delayed or bounced.

The little test that cried wolf

However the story wouldn’t be complete without the additional human errors that let this happen as well. Thanks to help from the My Opera developer Cosimo, we internally have a Jenkins continuous integration (CI) server setup. This means that on every code/configuration commit, the following tests occur:

  • Roll back a virtual machine instance to a known state
  • Start the virtual machine
  • git pull the latest repository code
  • Re-install all the configuration files
  • Start all services
  • Run a series of code unit tests
  • Run a series of functional tests that test logging into the web interface, sending email, receiving the email, reading the email, and much more. There’s also a series of email delivery tests to check that all the core aspects of spam, virus and backscatter checking work as expected.

This test should have picked up the problem. So what happened? Well a day or so before the problem commit, another change occurred that altered the structure of branches on the primary repository. This caused the git pull the CI server does to fail, and thus the CI tests to fail.

While we were in the process of working out what was wrong and fixing this on the CI server (only pulling the "master" branch turned out to be the easiest fix), the problematic commit went in. So once we fixed up the git pull issue, we then found that the very first test on the CI server was failing with some strange IMAP connection failure error. Rather than believing this was a real problem, we assumed it was due to something else with the CI tests being broken after the branch changes, and resolved to look at it on Monday. Of course the test really was broken due to a bad commit, and as Murphy’s Law would dictate, someone else would do a rollout on Friday Norway time of the broken commit.

Putting it all together

A combination of software, human and process errors all came together to create an outage that affected all Fastmail users for several hours at least, and some users even more.

Obviously we want to ensure this particular problem doesn’t happen again, and more importantly, that processes are in place where possible to avoid other cascade type failures in the future. We already have tickets to fix the particular code and configuration related issues in this case:

  • link cyrus cores and log directories to another partition
  • make skiplist more robust in case of a disk full condition
  • make cyrus.header/index/cache more robust in case of a disk full condition
  • cyrus master respawning should backoff in the case of immediate crashes
  • fix cyrus quota -f to avoid random quota usage doubling

We’ve also witnessed how important keeping the CI tests clean, and tracking down all failures are important. We’ve immediately added new tests to sanity check database, imap and smtp connections as a very first step before any other functional tests are run. If any of them fail, we tail the appropriate log files and list the contents of the core directories, so the CI failure emails that are sent to all developers will make it very clear that there’s a serious problem that needs immediate investigation.

Posted in News, Technical. Comments Off

HTML editor upgraded from FCKEditor 2 to CKEditor 3

We’ve just upgraded the HTML editor we use on the Compose screen in rich text mode from FCKEditor 2 to CKEditor 3. This new editor should be faster to load and edit than the old one. Additionally the new editor works with Internet Explorer 9 which was released today as well.

Posted in News. Comments Off

Fastcheck users need to upgrade to latest version – 2.2.0.0

Users of Fastcheck should ensure they upgrade to the latest version, 2.2.0.0 available at http://www.fastcheck.org. Because of a change to the underlying protocol, some features may no longer work with older versions.

Posted in News. Comments Off

Operamail.com has been migrated to Fastmail.FM

Operamail.com is finally back where it belongs: Hosted and run directly by Opera Software.

Over the past few months, we’ve been preparing for the migration of operamail.com to Fastmail’s servers. That migration has finally started today. Over the last hour, we’ve switched across the domain operamail.com to point to Fastmail’s servers, so users going to http://www.operamail.com will see the new login screen. Existing users can login with their existing operamail.com username and password (see below regarding some accounts showing a different name after logging in)

Any new email will be delivered directly to your new Fastmail based account, however we have to migrate existing email from the old servers to the Fastmail servers, so you may not immediately see all existing folders and emails. We’re doing that in the background, and prioritising users based on their most recent login time, so over the next few hours, you should see all your existing folders and email re-appear in your account.

We’ve also migrated across all other information we could, such as address books, some preferences, etc.

A quick summary of some of the advantages of Fastmail over the existing accounts

Other notes:

  • Login problems – For a short time during the migration some users could not login. This should be fixed now
  • Username changed – Some users had usernames on the old server that were not compatible with the new server. In those cases, we’ve renamed the account name to a new name, but we’ve created an alias so email sent to your old email address still goes to your account, and you can login with your old account name. In fact the only difference you’ll notice is the username in the top right hand corner when you login is shown differently, and system notification emails will mention your new account name, not the old account name
Posted in News. Comments Off

Users must use http://old.fastmail.fm to access the old interface

Our "new" web interface was rolled out almost 2 years ago now. In that time, the vast majority of users have switched to using the new interface. It has significant improvements over the old interface, such as keyboard shortcuts, better searching capabilities, more customisation options, and all new features are developed exclusively for the new interface.

At the time of the rollout, we allowed people to choose which interface they preferred to use via a preference option. However maintaining the link that allows switching between the old and the new interface has been tricky and messy.

Because of that, we’re now completely separating the interfaces. Users that wish to use the old interface must now explicitly login at http://old.fastmail.fm. The preference set on the Options -> Account Preferences screen will no longer work.

Business/family users will also have to use http://old.fastmail.fm and use their full login name.

Please note the old interface is deprecated. At the time we launched the new interface we said "We plan to keep [the old interface] for at least 6-12 months." We’ve supported the interface significantly beyond that time, and will continue to support it for some more months, but it will be shut down later this year.

Posted in News. Comments Off

Special pricing for Enhanced signups/upgrades/extensions until end of December

Until the end of December 2010, we’re running a special on all Enhanced signups, upgrades and extensions.

We’re taking $10 off the regular price of each year. So instead of $39.95 for the first year and $29.95 for each subsequent year (if you select a multi-year subscription), it’s now $29.95 for the first year and $19.95 for each subsequent year.

Subscription Current Special
1 year $39.95 $29.95 ($10 off)
2 years $69.90 $49.90 ($20 off)
3 years $99.85 $69.85 ($30 off)
4 years $129.80 $89.80 ($40 off)
5 years $159.75 $109.75 ($50 off)

Note: new signups can initially only pay for one year. However if you signup for a new account, you can immediately go to Options -> Upgrade to take advantage of the multi-year upgrade option.

This special is strictly until the end of December 2010 only.

Posted in Marketing, News. Comments Off

New http://m., http://old., http://beta. and http://ssl. site prefixes

We’ve created a set of new domains for users to use to enable/test various features. These new prefixes should be added to the front what whichever domain you use. For instance instead of http://www.fastmail.fm, you can use http://m.fastmail.fm, http://old.fastmail.fm, http://beta.fastmail.fm and http://ssl.fastmail.fm. This also applies to all Fastmail domains, such http://m.eml.cc, http://old.myfastmail.com, http://beta.sent.com, etc.

Force the mobile web interface via http://m.

Fastmail will attempt to detect if you’re using a mobile device, and display a mobile optimised version of the site if it detects that to be the case (eg Opera Mini, Opera Mobile, iPhone, etc). However it’s not possible to detect all devices, or in some cases the detection may be incorrect.

By using the http://m. prefix (eg http://m.fastmail.fm), this will force the mobile version of the site to be displayed.

Note that this is separate to the WAP version of the site, which is a very simple interface optimised for extremely low end phones that only have a WAP browser, which is different to a web browser. We generally don’t recommend the WAP site. For most low end phones we recommend using Opera Mini and the mobile http://m.fastmail.fm site.

Use the old web interface via http://old.

As mentioned the other day, we’re moving the old web interface from http://www.fastmail.fm/old/ to http://old.fastmail.fm.

The old web interface is deprecated. No more development or updates are being made to it. Features will be progressively disabled where they conflict with new changes (eg database changes, IMAP server changes, etc). We highly recommend users of the old web interface switch to the new interface. The improved search and keyboard shortcuts alone are a huge productivity improvement.

Sometime soon we’ll also be removing the user web interface preference, so that to login to the old web interface you will have to use http://old.fastmail.fm, using http://www.fastmail.fm will always login to the new interface. We’ll be letting users of the old web interface know about that change shortly.

Use the beta web interface via http://beta.

The beta interface is where we test new features before rolling them out to production. We try and keep the beta interface stable, but we definitely don’t guarantee it. It may have serious bugs that cause downtime and/or email loss. If you like living on the bleeding edge you can use it, but for general day to day usage we don’t recommend it.

Previously the beta server lived at http://www.fastmail.fm/beta/ but is moving to http://beta.fastmail.fm.

Force redirect to https:// (SSL encrypted) version of the site via http://ssl.

Fastmail supports over 100 different domains for users to signup at, as well as thousands of hosted domains for users, families and businesses. Unfortunately because of the way SSL encryption works, you need a separate SSL certificate for every domain (yes, there are some exceptions to this such as wildcards and SANs, but the general rule applies). It would be prohibitively expensive to buy SSL certificates for every domain we support.

Instead whenever we want to secure a connection, we redirect a user to our https://www.fastmail.fm domain. However this can be a little confusing users if we immediately do this when they go to our other addresses like http://eml.cc, http://sent.com, etc, so over the years we’ve built a slightly complex set of rules.

When you first enter a domain in your browser (eg http://eml.cc), we don’t redirect. However if you click "Secure Login", we will replace the target of the post request to https://www.fastmail.fm so the content (eg your username and password) is encrypted.

At that point, we also set a cookie at the eml.cc domain, so the next time you go to http://eml.cc, it automatically redirects immediately to https://www.fastmail.fm so everything is immediately encrypted. This was done because the default login button (eg "Secure Login" or "Login") used to be set depending on if you were at an https:// or http:// domain respectively. If you clicked "Secure Login", we assumed you wanted "Secure Login" to be the default next time. This isn’t as relevant now that "Secure Login" is always the default, but it’s still good practice to redirect to the secure site immediately.

To add to these issues is the way usernames and domains interact. If you have an account bob@eml.cc, then if you go to http://eml.cc, you can login with just the username "bob". However if you go to http://www.fastmail.fm, you would have to login with the full name "bob@eml.cc". Note that this works even if a redirect occurs. That is, if you go to http://eml.cc and you’re redirected to https://www.fastmail.fm because of a previous "Secure Login" you did, you can still login with just "bob", Fastmail remembers the "original" domain you arrived on.

However this doesn’t help with users that use public terminals a lot. There won’t be a redirect cookie to go to the secure site by default. To help users with non @fastmail.fm addresses who are security conscious, we’ve add http://ssl. prefixes to all sites (eg. http://ssl.eml.cc) which causes an immediate redirect to https://www.fastmail.fm while remembering the original domain.

This is a small tweak, but useful for some people that are security conscious.

Posted in News, Technical. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 3,163 other followers