HTTP keep-alive connection timeouts

This is a technical post. Regular Fastmail users subscribed to receive email updates from the Fastmail blog can just ignore this post.

The average user of the Fastmail website is probably a bit different to most websites. Webmail tends to be a "productivity application" that people use for an extended period of time. So for the number of web requests we get, we probably have less individual users than other similar sized sites, but the users we do have tend to stay for a while and do lots of actions/page views.

Because of that we like to have a long HTTP keep-alive timeout on our connections. This makes interactive response nicer for users as moving to the next message after spending 30 seconds reading a message is quick because we don’t have to setup a new TCP connection or SSL session, we just send the request and get the response over the existing keep-alive one. Currently we set the keepalive timeout on our frontend nginx servers to 5 minutes.

I did some testing recently, and found that most clients didn’t actually keep the connection open for 5 minutes. Here’s the figures I measured based on Wireshark dumps.

  • Opera 11.11 – 120 seconds
  • Chrome 13 – at least 300 seconds (server closed after 300 second timeout)
  • IE 9 – 60 seconds (changeable in the registry, appears to apply to IE 8/9 as well though the page only mentions IE 5/6/7)
  • Firefox 4 – 115 seconds (changeable in about:config with network.http.keep-alive.timeout preference)

I wondered why most clients used <= 2 minutes, but Chrome was happy with much higher.

Interestingly one of the other things I noticed while doing this test with Wireshark is that after 45 seconds, Chrome would send a TCP keep-alive packet, and would keep doing that every 45 seconds until the 5 minute timeout. No other browser would do this.

After a bunch of searching, I think I found out what’s going on.

It seems there’s some users behind NAT gateways/stateful firewalls that have a 2 minute state timeout. So if you leave an HTTP connection idle for > 2 minutes, the NAT/firewall starts dropping any new packets on the connection and doesn’t even RST the connection, so TCP goes into a long retry mode before finally returning that the connection timed out to the application.

To the user, the visible result is that after doing something with a site, if they wait > 2 minutes, and then click on another link/button, the action will just take ages to eventually timeout. There’s a Chrome bug about this here:

http://code.google.com/p/chromium/issues/detail?id=27400

So the Chrome solution was to enable SO_KEEPALIVE on sockets. On Windows 7 at least, this seems to cause TCP keep-alive pings to be sent after 45 seconds and every subsequent 45 seconds, which avoids the NAT/firewall timeout. On Linux/Mac I presume this is different because they’re kernel tuneables that default to much higher. (Update: I didn’t realise you can set the idle and interval for keep-alive pings at the application level in Linux and Windows)

This allows Chrome to keep truly long lived HTTP keep-alive connections. Other browsers seem to have worked around this problem by just closing connections after <= 2 minutes instead.

I’ve mentioned this to the Opera browser network team, so they can look at doing this in the future as well, to allow longer lived keep-alive connection.

I think it’s going to be a particularly real problem with Server-Sent Event type connections that can be extremely long lived. We’re either going to have to send application level server -> client pings over the channel every 45 seconds to make sure the connection is kept alive, or enable a very low keep-alive time on the server and enable SO_KEEPALIVE on each event source connected socket.

robm@fastmail.fm

Posted in Technical. Comments Off

Download non-english filenames

This is a technical post. Regular Fastmail users subscribed to receive email updates from the Fastmail blog can just ignore this post.

When you want to send a download file to a user based on a web request, it’s well known you can just set the Content-Disposition header to attachment to get the browser to download the content and save it locally on the users machine. Additionally you can add a filename parameter to control the filename displayed to users.

Content-Disposition: attachment; filename="foo.txt"

The problem comes when the filename contains non-english (really non-ASCII characters). RFC2231 defines a way of adding character set encodings to MIME parameters. Unfortunately support for this RFC is scattered, and browsers have implemented various internal hacks/workarounds (eg. % URL encoded octets). The situation is sufficiently complicated that someone came up with a comprehensive set of tests and there’s a good stackoverflow answer.

However looking over the test case examples, I realised that there appeared to be a solution that would work on all browsers except Safari quite well. The attwithfn2231utf8 test shows that all modern browsers except IE and Safari support the RFC2231 encoding. The attfnboth test shows that if you have a traditional filename parameter followed by a RFC2231 filename* parameter, IE and Safari pick the traditional parameter. The attwithfnrawpctenclong test shows that if you use % URL encoded octets in a traditional filename parameter, IE attempts to decode them as UTF-8 octets.

Putting that together, if you want to send a file called foo-ä.html, then setting a header of:

Content-Disposition: attachment; filename="foo-%c3%a4.html"; filename*=UTF-8''foo-%c3%a4.html

Will cause IE8+, Opera, Chrome, FF4+ (but not Safari) to correctly save a file named foo-ä.html. This should be easy to do with a URL escaping library that encodes UTF-8 octets not in the unreserved character set.

robm@fastmail.fm

Posted in Technical. Comments Off

World IPv6 day

This is a technical blog post about Fastmail’s support of an internet standard called ipv6. Most users shouldn’t notice any difference at all, and can ignore this post. For those interested, we’ve included some description about what we’ve done to support ipv6.

We didn’t actually get organised enough to register ourselves for world ipv6 day, but we got ipv6 up and running in time, so we’ve enabled it anyway.  You can read more about ipv6 day here: http://www.worldipv6day.org/

Our prefix for NYI is 2610:1c0:0:1:: – and for convenience we’re mapping all the IP addresses from our ipv4 space (66.111.4.0/24) as direct offsets into that ipv6 space.  All the public service IPs are now bound in ipv4 and ipv6.  There was some magic requried to support our failover system because Linux doesn’t offer an option to bind non-local ipv6 addresses – so we do a little dance where the address is bound to the loopback interface on one machine as a /128 (host only) address – or to the external interface as a /64 (fully networked) address depending on where the service is located.  It seems to work OK, which is the main thing!

Due to rbl issues and the lack of working reverse DNS, we have not enabled ipv6 for inbound or outbound SMTP, and our DNS server doesn’t support ipv6 connectivity, so all your DNS queries will still be over ipv4.

The domains with ipv6 support (AAAA records) are:

  • mail.messagingengine.com
  • web.messagingengine.com
  • dav.messagingengine.com
  • http://www.fastmail.fm
  • mail.opera.com

This will pick up the majority of web, imap, pop and smtp authenticated traffic. If you have ipv6 connectivity, you should be transparently using ipv6 now.  We are getting random bits of ipv6 traffic in the logs, so it’s clearly working.

Our FTP server doesn’t support ipv6 EPRT and EPSV commands, so I haven’t added a record for ftp.messagingengine.com.

You can also try ipv6.messagingengine.com if you want to guarantee you’re using ipv6 only. That host doesn’t have an A record for ipv4.

Unless there are reports of significant problems with this experiment, we will remain dual stack into the future :)

Posted in Technical. Comments Off

Outage report – a cascade of errors

On Friday we had one of the worst outages we’ve had in over 3 years. For at least 3 hours, all accounts were inaccessible (web, IMAP and POP), and for a few users, it was several hours longer than that. For some other users, there were additional mailbox and quota problems after that.

Obviously this is something we never want to happen, and over the years we’ve setup many systems to avoid outages like this occurring.

Summary

A small “trivial” change to a configuration program that was rolled out caused a cascading series of events that resulted in some important files being corrupted. We had to take down all services and rebuild the corrupted files from the last backup and add in any changes since the backup. Once rebuilt, we were able to bring back up all services. A separate corruption issue that affected a few users caused some longer outages and quota issues until we fixed those mailboxes.

Future mitigation

We’ve identified the chain of events that caused the problems. Because it’s actually a chain of events, we’ve identified at least 5 separate issues to fix, so that this problem, or something similar to it won’t happen again, and we’ll be doing those over the next week.

Of course we can never be 100% sure that there aren’t other cascade event paths that will cause outages, but by learning from past mistakes, fixing the known problems, continuously enhancing our test infrastructure, and being more aware of possible consequential errors in the future, we’re always aiming to minimise the chance of them occurring and providing the highest reliability possible.

Technical description

Although it’s draining and frustrating to see a large problem like this occur that so badly affects our users, it’s been fascinating to actually investigate this problem. What we end up seeing is how a set of little mistakes, bad timing, decisions made long ago, and human error (all of which are wonderfully obvious in hindsight) end up causing a much bigger problem than the initial trigger would ever suggest.

The domino effect

In this case, the problem stemmed from a cascading sequence of issues that started with a single misplaced comma.

  1. Cyrus configuration file error. The underlying trigger of the cascade was a single misplaced comma in a configuration file. In this case the error was detected by the developer during testing, who fixed the problem, but unfortunately pushed the fix to a different branch.
  2. Core dump behaviour of fatal(). The effect of the broken configuration file was that immediately after forking a new imapd process, it would try to parse the configuration file, fail to do so, and call the fatal() function. Normally that would just cause the process to exit. However our branch of cyrus has a patch we added that means all calls to the fatal() function dump core instead of just exiting; this is normally very useful for debugging and quality control.
  3. Kernel configured to add pid to core files. We also configure our kernels with the sysctl kernel.core_uses_pid=1 which ensures that each separate process crash/abort() generates a separate core file on disk rather than overwriting the previous one. Again this is very useful for debugging.
  4. Cyrus master process doesn’t rate limit forking of child processes. The cyrus master process that forks child processes doesn’t do enough sanity checking. Specifically, if an imapd exits immediately after forking, the master process will happily immediately fork another imapd, despite there being zero chance that the new imapd will do any better. At the very least this leads to a CPU-chewing loop (as well as a non-functional imapd) as each forked imapd process immediately exits and the master creates a new one.
  5. Core files end up in cyrus meta directory. Cyrus supports the concept of separating meta-data files from email data. This is very useful as it allows us to place the small  but "hot" meta data files on fast (but small) drives (eg. 10k/15k RPM drives, or SSD drives in new machines), and place the email data files on slower and much larger disks. The "cores" directory where core dumps end up is located on the same path as the meta data directory.
  6. Cyrus skiplist database format can corrupt in disk full conditions. Cyrus stores some important data in an internal key/value database format called skiplist. The most important data is a list of all mailboxes on the server. This database format works very well for the way cyrus accesses data, it’s been very fast and robust. However it turns out the code doesn’t handle the situation where a disk fills up and writes only partially succeed, causing database corruption.

Putting all the above together, creates the disaster. A small configuration change was rolled out. Every new incoming IMAP connection would cause a new imapd to be forked and immediately abort and dump core. Each core file would end up with a separate filename. This very quickly caused the cyrus meta partitions to fill up. They reached 100% full before we fully realised what was happening. This caused changes to the mailboxes database to only partially write, causing a lot of them to become corrupted.

When we realised this is what had happened, we quickly stopped all services, undid the change, and tried to recover the corrupted databases. Fortunately the databases are backed up each half hour, and there are replicas as well. Using some libraries, we were able to quickly put together code that pulled any still valid records, records from the backup, and records from the replicas and combined them, and rebuilt the mailboxes databases, and then started everything back up.

Adding insult to injury

Fortunately for most people, the mess stopped there. Unfortunately for a few users, there were some additional problems as well.

As well as the mailboxes database corruptions, it was discovered that the code that maintains the cyrus.header and cyrus.index files also didn’t like the partial writes that disk full conditions generate. This caused a small number of mailboxes to be completely inaccessible (Inbox select failed).

Fortunately cyrus has a simple utility to fix corruptions of this form called "reconstruct", so we ran that to fix up any broken mailboxes. Fixing up a mailbox with reconstruct however doesn’t fix up quota calculations, and that has a separate utility "quota" that you can run with a –f flag to fix quotas. We ran that on users to make sure all quota calculations were correct.

Unfortunately there’s a bug in the quota fix code that in some edge cases can double the apparent quota usage of users. This caused a number of accounts to have an incorrect quota usage to be set on their account, and in some cases caused them to go over their quota, causing new messages to be delayed or bounced.

The little test that cried wolf

However the story wouldn’t be complete without the additional human errors that let this happen as well. Thanks to help from the My Opera developer Cosimo, we internally have a Jenkins continuous integration (CI) server setup. This means that on every code/configuration commit, the following tests occur:

  • Roll back a virtual machine instance to a known state
  • Start the virtual machine
  • git pull the latest repository code
  • Re-install all the configuration files
  • Start all services
  • Run a series of code unit tests
  • Run a series of functional tests that test logging into the web interface, sending email, receiving the email, reading the email, and much more. There’s also a series of email delivery tests to check that all the core aspects of spam, virus and backscatter checking work as expected.

This test should have picked up the problem. So what happened? Well a day or so before the problem commit, another change occurred that altered the structure of branches on the primary repository. This caused the git pull the CI server does to fail, and thus the CI tests to fail.

While we were in the process of working out what was wrong and fixing this on the CI server (only pulling the "master" branch turned out to be the easiest fix), the problematic commit went in. So once we fixed up the git pull issue, we then found that the very first test on the CI server was failing with some strange IMAP connection failure error. Rather than believing this was a real problem, we assumed it was due to something else with the CI tests being broken after the branch changes, and resolved to look at it on Monday. Of course the test really was broken due to a bad commit, and as Murphy’s Law would dictate, someone else would do a rollout on Friday Norway time of the broken commit.

Putting it all together

A combination of software, human and process errors all came together to create an outage that affected all Fastmail users for several hours at least, and some users even more.

Obviously we want to ensure this particular problem doesn’t happen again, and more importantly, that processes are in place where possible to avoid other cascade type failures in the future. We already have tickets to fix the particular code and configuration related issues in this case:

  • link cyrus cores and log directories to another partition
  • make skiplist more robust in case of a disk full condition
  • make cyrus.header/index/cache more robust in case of a disk full condition
  • cyrus master respawning should backoff in the case of immediate crashes
  • fix cyrus quota -f to avoid random quota usage doubling

We’ve also witnessed how important keeping the CI tests clean, and tracking down all failures are important. We’ve immediately added new tests to sanity check database, imap and smtp connections as a very first step before any other functional tests are run. If any of them fail, we tail the appropriate log files and list the contents of the core directories, so the CI failure emails that are sent to all developers will make it very clear that there’s a serious problem that needs immediate investigation.

Posted in News, Technical. Comments Off

New http://m., http://old., http://beta. and http://ssl. site prefixes

We’ve created a set of new domains for users to use to enable/test various features. These new prefixes should be added to the front what whichever domain you use. For instance instead of http://www.fastmail.fm, you can use http://m.fastmail.fm, http://old.fastmail.fm, http://beta.fastmail.fm and http://ssl.fastmail.fm. This also applies to all Fastmail domains, such http://m.eml.cc, http://old.myfastmail.com, http://beta.sent.com, etc.

Force the mobile web interface via http://m.

Fastmail will attempt to detect if you’re using a mobile device, and display a mobile optimised version of the site if it detects that to be the case (eg Opera Mini, Opera Mobile, iPhone, etc). However it’s not possible to detect all devices, or in some cases the detection may be incorrect.

By using the http://m. prefix (eg http://m.fastmail.fm), this will force the mobile version of the site to be displayed.

Note that this is separate to the WAP version of the site, which is a very simple interface optimised for extremely low end phones that only have a WAP browser, which is different to a web browser. We generally don’t recommend the WAP site. For most low end phones we recommend using Opera Mini and the mobile http://m.fastmail.fm site.

Use the old web interface via http://old.

As mentioned the other day, we’re moving the old web interface from http://www.fastmail.fm/old/ to http://old.fastmail.fm.

The old web interface is deprecated. No more development or updates are being made to it. Features will be progressively disabled where they conflict with new changes (eg database changes, IMAP server changes, etc). We highly recommend users of the old web interface switch to the new interface. The improved search and keyboard shortcuts alone are a huge productivity improvement.

Sometime soon we’ll also be removing the user web interface preference, so that to login to the old web interface you will have to use http://old.fastmail.fm, using http://www.fastmail.fm will always login to the new interface. We’ll be letting users of the old web interface know about that change shortly.

Use the beta web interface via http://beta.

The beta interface is where we test new features before rolling them out to production. We try and keep the beta interface stable, but we definitely don’t guarantee it. It may have serious bugs that cause downtime and/or email loss. If you like living on the bleeding edge you can use it, but for general day to day usage we don’t recommend it.

Previously the beta server lived at http://www.fastmail.fm/beta/ but is moving to http://beta.fastmail.fm.

Force redirect to https:// (SSL encrypted) version of the site via http://ssl.

Fastmail supports over 100 different domains for users to signup at, as well as thousands of hosted domains for users, families and businesses. Unfortunately because of the way SSL encryption works, you need a separate SSL certificate for every domain (yes, there are some exceptions to this such as wildcards and SANs, but the general rule applies). It would be prohibitively expensive to buy SSL certificates for every domain we support.

Instead whenever we want to secure a connection, we redirect a user to our https://www.fastmail.fm domain. However this can be a little confusing users if we immediately do this when they go to our other addresses like http://eml.cc, http://sent.com, etc, so over the years we’ve built a slightly complex set of rules.

When you first enter a domain in your browser (eg http://eml.cc), we don’t redirect. However if you click "Secure Login", we will replace the target of the post request to https://www.fastmail.fm so the content (eg your username and password) is encrypted.

At that point, we also set a cookie at the eml.cc domain, so the next time you go to http://eml.cc, it automatically redirects immediately to https://www.fastmail.fm so everything is immediately encrypted. This was done because the default login button (eg "Secure Login" or "Login") used to be set depending on if you were at an https:// or http:// domain respectively. If you clicked "Secure Login", we assumed you wanted "Secure Login" to be the default next time. This isn’t as relevant now that "Secure Login" is always the default, but it’s still good practice to redirect to the secure site immediately.

To add to these issues is the way usernames and domains interact. If you have an account bob@eml.cc, then if you go to http://eml.cc, you can login with just the username "bob". However if you go to http://www.fastmail.fm, you would have to login with the full name "bob@eml.cc". Note that this works even if a redirect occurs. That is, if you go to http://eml.cc and you’re redirected to https://www.fastmail.fm because of a previous "Secure Login" you did, you can still login with just "bob", Fastmail remembers the "original" domain you arrived on.

However this doesn’t help with users that use public terminals a lot. There won’t be a redirect cookie to go to the secure site by default. To help users with non @fastmail.fm addresses who are security conscious, we’ve add http://ssl. prefixes to all sites (eg. http://ssl.eml.cc) which causes an immediate redirect to https://www.fastmail.fm while remembering the original domain.

This is a small tweak, but useful for some people that are security conscious.

Posted in News, Technical. Comments Off

Old web interface moved to http://old.fastmail.fm

To make some internal changes easier, we’ve now moved the old web interface to http://old.fastmail.fm. For the moment, http://www.fastmail.fm/old/ will continue to work, but this will be phased out shortly so you should use http://old.fastmail.fm from now on.

The old web interface is deprecated. No more development or updates are being made to it. Features will be progressively disabled where they conflict with new changes (eg database changes, IMAP server changes, etc). We highly recommend users of the old web interface switch to the new interface. The improved search and keyboard shortcuts alone are a huge productivity improvement.

Sometime soon we’ll also be removing the user web interface preference, so that to login to the old web interface you will have to use http://old.fastmail.fm, using http://www.fastmail.fm will always login to the new interface. We’ll be letting users of the old web interface know about that change shortly.

Posted in News, Technical. Comments Off

HTML editor on beta server upgraded

We use a third party editor to allow editing of HTML emails. For a while, we’ve used FCKEditor, but development on it has stopped to be replaced with CKEditor. Because of compatibility problems reported between FCKEditor and the soon to be released IE9, we’re going to upgrade to CKEditor.

I’ve now done that on our beta server, so if anyone has any problems, please let me know at robm@fastmail.fm so we can fix them before we release the change to production.

Posted in Technical. Comments Off

SSL certificate for www.fastmail.fm changed to *.fastmail.fm

We’ve made a small change to the configuration on our servers that updates the SSL certificate we were using from one that was just for www.fastmail.fm to a wildcard *.fastmail.fm certificate. For users, there should be no visible change, and everything should just continue to work as normal. However the last time we made a change like this, there were some very old email clients/browsers that had problems, so this is just to let people know that might suddenly receive a new warning message, nothing is wrong, it’s just that your email client/browser is very old and should be updated.

Posted in News, Technical. Comments Off

Speeding up WebDAV on Windows 7

Windows 7 has a built in WebDAV client that allows you to access your Fastmail file storage area as just another drive on your computer.

  • Open Windows Explorer
  • Click on “Computer” in the sidebar list at the left
  • Click “Map network drive” on the button list at the top of the window
  • Select a Drive letter to map to (Z: is the default)
  • Enter https://dav.messagingengine.com/ as the Folder to connect to
  • You’ll be prompted for your username (use your full username including the @domain part) and password, and whether you want to automatically remember those login credentials in the future

Unfortunately by default, you’ll probably find that accessing files and directories is very slow, with a multi-second pause between any action. You can speed this up by doing the following:

  • Open Internet Explorer
  • Go to Tools –> Internet Options –> Connections (tab) –> LAN Settings (Button) and then uncheck the “Automatically detect settings” checkbox.
  • Click Ok (button) and Ok (button) again to close the dialogs, then quit Internet Explorer

You should find WebDAV access performance is considerably improved

Posted in News, Technical. Comments Off

cyrus 2.4 released

Over the last 6-12 months, one of our employees Bron Gondwana has put a lot of effort into improving our IMAP/POP server, which is the open source cyrus IMAP server. Most of the improvements were rolled out to Fastmail customers a few months ago and documented in this blog post.

Today however marks the official release of cyrus 2.4, which includes all of Bron’s changes and numerous other improvements from other contributors. Congratulations everyone involved with the cyrus project!

Posted in News, Technical. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 4,622 other followers