iOS 5 and mail application access patterns

This post contains some observations about how the mail application in iOS 5 appears to interact with IMAP servers. We’re posting this mostly as a reference for people interested.

In iOS settings, you can choose a "fetch interval", which is:

  • Manually (never fetches automatically)
  • Every 15 minutes
  • Every 30 minutes
  • Every hour
  • Push (only shown on servers supporting it, which I believe is currently only Exchange servers or Yahoo Mail)

If you choose "Manually", then there is no persistent connection once you exit the mail app.

If you choose any other interval, then a background daemon holds a persistent connection to the mail server. We don’t know exactly why they hold the connection open, and we’re not sure if it leaves the connection in IDLE state to get updates pushed to it. The main advantage of holding it open is probably skipping the overhead of re-authenticating/handshaking, but there’s also no good reason to explicitly close the connection after every fetch given that IMAP is supposed to be long-lived.

If you have these fetch intervals set, and then break your network connection, then iOS will attempt to reconnect the next time it wants to fetch your mail again.

Note that the intervals listed appear to be only approximate. iOS appears to be smart about batching requests together, so it gets as much work done as it can while the phone is awake or the network connection is "up". Also, opening the mail app, or opening a folder in the mail app, will often trigger a refresh too.

Posted in Technical. Comments Off

"View" link removed from attachments on message read screen in "Public Terminal" mode

When you enable the "Public Terminal" option on the login screen, Fastmail sets the "no-cache" and "no-store" cache control headers on every page. This means that browsers should not store a copy of the pages you visit (e.g. emails you read) to their local disk. Even after you logout of your session and leave the computer, someone comes along and tries to view a page in the browser history, it should re-check with the server first, which of course will return "this user is now logged out, show the login page instead".

However this is a problem with this whole setup related to attachments. When an email has an attachment, the content of the attachment might be in a form the browser doesn’t understand (e.g. Microsoft Word document). In that case, the browser has to save a copy of the attachment to the local disk, and then launch Microsoft Word to open the file.

Now in the case of the "View" link, the saving to disk would be done automatically into a temporary file storage area. However in IE, if you try and download an SSL document with the no-cache or no-store attributes set, IE will explicitly not save the file to disk, and then when it tries to launch Microsoft Word to read the file, you’ll get a "file does not exist" error or the like.

http://support.microsoft.com/kb/812935
http://support.microsoft.com/kb/815313

For other browsers, it appears they work around this problem by actually saving a copy to disk in the temporary storage area, but they delete the file when you close the browser (at least that’s what Firefox did when I tested). That still potentially does leave the file on disk for some time.

To ensure the best privacy possible, while still allowing people to view attached documents in "Public Terminal" mode, we’ve decided to do the following:

  • When you login with the "Public terminal" mode, we’ve removed the "View" link next to attachments. This solves two problems; the unexpected "file not found" in IE, and the privacy concern of storing attachments to disk in the temporary file area of other browsers
  • We’ve left the "View" link next to image attachments, because the web browser can display images itself, without launching a separate program, so it can obey the "no-cache"/"no-store" directives
  • With the "Download" link (which automatically brings up a "Save as…" dialog box), we’ve removed the "no-cache" and "no-store" settings, which means that IE will let you download it and save it somewhere so you can open it to view the document.

We like this solution because it makes things clearer to the user. In "Public Terminal" mode, if you want to view an attachment, you have to download it first, explicitly save it somewhere and then view it. The alternative approach of letting the browser do it either fails (IE), or causes an auto-save of the file to a temporary area which leaves it temporarily cached on the machine when the user doesn’t expect it.

Posted in News, Technical. Comments Off

TCP keepalive, iOS 5 and NAT routers

This post contains some very technical information. For users just interested in the summary:

If over the next week you experience an increase in frozen, non-responding or broken IMAP connections, please contact our support team (use the "Support" link at the top of the http://www.fastmail.fm homepage) with details. Please make sure you include your operating system, email software, how you connect to the internet, and what modem/router/network connection you use in your report.

The long story: The IMAP protocol is designed as a long lived connection protocol. That is, your email client connects from your computer to the server, and stays connected for as long as possible.

In many cases, the connection remains open, but completely idle for extended periods of time while your email client is running but you are doing other things.

In general while a connection is idle, no data at all is sent between the server and the client, but they both know the connection still exists, so as soon as data is available on one side, it can send it to the other just fine.

There is a problem in some cases though. If you have a home modem and wireless network, then you are usually using a system called NAT that allows multiple devices on your wireless network to connect to the internet through one connection. For NAT to work, your modem/router must keep a mapping for every connection from any device inside your network to any server on the internet.

The problem is some modems/routers have very poor NAT implementations that "forget" the NAT mapping for any connection that’s been idle for 5 minutes or more (some appear to be 10 minutes or more). What this means is that if an IMAP connection remains idle with no communication for 5 minutes, then the connection is broken.

In itself this wouldn’t be so bad, but the way the connection is broken is that rather than telling the client "this connection has been closed", packets from the client or server just disappear which causes some nasty user visible behaviour.

The effect is that if you leave your email client idle for 5 minutes and the NAT connection is lost, if you then try and do something with the client (e.g. read or move an email), the client tries to send the appropriate command to the server. But the TCP packets that contain the command never arrive at the server, but neither are RST packets sent back that would tell the client that there’s any problem with the connection, the packets just disappear. So the local computer tries to send again after a timeout period, and again a few more times, until usually about 30 seconds later, it finally gives up and marks the connection as dead, and finally sends that information back up to the email client, which shows some "connection was dropped by the server" type message.

From a user perspective, it’s a really annoying failure mode that looks like a problem with our server, even though it’s really because of a poor implementation of NAT in their modem.

However this is a workaround for this. At the TCP connection level, there’s a feature called keepalive that allows the operating system to send regular "is this connection still open?" type packets back and forth between the server and the client. By default keepalive isn’t turned on for connections, but it is possible to turn it on via a socket option. nginx, our frontend IMAP proxy, allows you to turn this on via a so_keepalive configuration option.

However even after you’ve enabled this option, the default time between keepalive "ping" packets is 2 hours. Fortunately again, there’s a Linux kernel tuneable net.ipv4.tcp_keepalive_time that lets you control this value.

By lowering this value to 4 minutes, it causes TCP keepalive packets to be sent over open but idle IMAP connections from the server to the client every 4 minutes. The packets themselves don’t contain any data, but what they do do is cause any existing NAT connection to be marked as "alive" on the users modem/router. So poor routers with NAT connections that would normally timeout after 5 minutes of inactivity are kept alive, so the user doesn’t see the nasty broken connection problem described above, and neither is there a visible downside to the user either.

So this is how things have been for the last 4-5 years, which has worked great.

Unfortunately, there’s a new and recent problem that has now appeared.

iOS 5 now uses long lived persistent IMAP connections (apparently previous versions only used short lived connections). The problem is that our ping packets every 4 minutes mean that the device (iPhone/iPad/iPod) is "woken up" every 4 minutes as well. This means the device never goes into a deeper sleep mode, which causes significantly more battery drain when you setup a connection to the Fastmail IMAP server on iOS 5 devices.

Given the rapid increase in use of mobile devices like iPhones, and the big difference in battery life it can apparently cause, this is a significant issue.

So we’ve decided to re-visit the need for enabling so_keepalive in the first place. Given the original reason was due to poor NAT routers with short NAT table timeouts, that was definitely an observed problem a few years back, but we’re not sure how much of a problem it is now. It’s possible that the vast majority of modems/routers available in the last few years have much better NAT implementations. Unfortunately there’s no way to easily test this, short of actually disabling keepalive, and waiting for users to report issues.

So we’ve done that now on mail.messagingengine.com, and we’ll see over the next week what sort of reports we get. Depending on the number, there’s a few options we have:

  1. If there’s lots of problem reports, we’d re-enable keepalive by default, but setup an alternate server name like mail-mobile.messagingengine.com that has keepalive disabled, and tell mobile users to use that server name instead. The problem with this is many devices now have auto configuration systems enabled, so users don’t even have to enter a server name, so we’d have to work out how to get that auto configuration to use a different server name
  2. If there’s not many problem reports, we’d leave keepalive off by default, but setup an alternative server name like mail-keepalive.messagingengine.com that has keepalive enabled, and for users that report connection "freezing" problems, we’d tell them to switch to using that server name instead
  3. Ideally, we’d detect what sort of client was connecting, and turn on/off keepalive as needed. This might be possible using software like p0f, but integrating that with nginx would require a bit of work, and still leaves you with the problem of an iPhone user that is usually in their office/home all day and uses a wireless network with a poor NAT router, would they prefer the longer battery life, or better connectivity experience.

I’ll update this post in a week or two when we have some more data.

Posted in News, Technical. Comments Off

DKIM signing outgoing email with From address domain

DKIM is an email authentication standard that allows senders of email to sign an email with a particular domain, and for receivers of the email to confirm that the email was signed by that domain and hasn’t been altered. There’s some more information about how DKIM is useful in this previous blog post. We’ve been DKIM signing all email sent via FastMail for the last 2 years.

In the original design of the DKIM, the domain that signed the email had no particular relationship to the domain in the From address of the email. This was particularly useful for large email providers like us. We have 10,000′s of domains, but would sign all email with just our "generic" messagingengine.com domain.

However this state of affairs is beginning to change. Standards like Author Domain Signing Practices explicitly link the domain of the email address in the From header of the email to the DKIM signing domain. Also recently Gmail has changed their web interface so that email sent with a From domain that’s different to the DKIM signing domain may be shown with an extra "via messagingengine.com" notice next to the sender name.

So we’ve now rolled out new code that changes how all emails sent through FastMail are DKIM signed. We always DKIM sign with messagingengine.com (as we always have), but we also now sign with a separate key for the domain used in the From address header where possible (see below for more details).

For most users, there should be no noticeable difference. For users that use virtual domains at FastMail, or have their own domain in a family/business, then when you send via FastMail, Gmail should no longer show "via messagingengine.com" on the received message (if your DNS is correctly setup, see below for more details).

For users that host their DNS with FastMail (eg. nameservers for your domain are ns1.messagingengine.com and ns2.messagingengine.com), this will "just work". We’ve generated DKIM public/private keys for all domains in our database, and automatically do so when new domains are added. We also publish the public keys for all domains via ns1.messagingengine.com/ns2.messagingengine.com.

In general if you can, we highly recommend hosting your DNS with us. For most cases the default settings we provide "just work", and if you need to customise your DNS, our control panel allows you to add any records of any type, without the arbitrary limitations many other DNS providers have.

However for users that host DNS for their domains externally and want to continue to do so, you’ll have to explicitly add the DKIM public key using your domain hosters DNS management interface. Unfortunately there’s 100′s of different DNS providers out there, so we can’t give specific directions for each one.

The general steps are:

  1. Login to your FastMail account and go to Options –> Virtual Domains (or Manage –> Domains for a family/business account).
  2. Scroll to the bottom, and you’ll see a new "DKIM signing keys" section. For each domain you have, you’ll see a DKIM public key.
  3. Login to your DNS provider, and create a new TXT record for each domain listed and use the value in the "Public Key" column as the TXT record data to publish.

Important: Note that you have to add the TXT record for the domain name shown in the DKIM signing keys section, which will be mesmtp._domainkey.yourdomain.com, do not add it for the base domain name yourdomain.com, that won’t work.

That should be it.

Note that initially each domain is marked as DKIM disabled (Enabled column = [ ]). While a domain is DKIM disabled, we won’t sign any sent emails. This is to avoid DKIM signing failures when the receiving side tries to lookup the public signature and fails to find it. We regularly check each domain to see if the correct public key TXT record is being published. If it is, we mark the domain in our database as "DKIM enabled" (Enabled column = [*]), and then begin signing sent emails.

So after you setup the records at your DNS provider, you should wait a few hours, then check this table again to see that the domain is now correctly DKIM enabled.

Some other technical notes:

There’s currently no way to change the public/private key used to sign emails or upload new ones. We always generate our own key pair for each domain and use the DKIM selector "mesmtp" to sign emails. This shouldn’t be a problem. If you’re transitioning from another provider to FastMail, you can use our custom DNS to publish the DKIM record of the previous provider with it’s selector as well as our own during the transition. Vice-versa for transitioning away from FastMail. The only other reason to change the selector would be if the private key was compromised, which should never happen as it’s stored securely in FastMail’s systems.

Posted in News, Technical. Comments Off

New XMPP/Jabber server

This is a technical post. Fastmail users subscribed to receive email updates from the Fastmail blog can ignore this post if they are not interested.

We’ve just replaced the XMPP/Jabber server we use for our chat service. Previously we had been using djabberd. While this worked well for us for the last few years, unfortunately it hasn’t been receiving much development recently. This means many newer XMPP extensions aren’t available.

We looked at a number of alternate server options: Tigase, Prosody, ejabberd, OpenFire. In the end, we settled on ejabberd because of it’s relative maturity, good administration documentation, it’s widespread use in existing large installations, the active development community and it’s support for multiple domains (in the newest version).

Fortunately our existing architecture separated the XMPP/Jabber server from the backend storage details of our system (eg. user lists, user rosters, chat logging, etc) with an HTTP JSON API. Because of this, it was fairly straightforward to completely remove djabberd, write the equivalent interfacing components for ejabberd and slot that into place. A perfect two month piece of work for our summer intern student Samuel Wejeus. Thanks Samuel!

That work has now been done, and yesterday we completely removed djabberd and replaced it with ejabberd. For users that use our chat service, there shouldn’t actually be any noticeable difference at this point, everything should just continue to work as it did, but with this new base we should be able to add more features in the future.

Posted in News, Technical. Comments Off

HTTP keep-alive connection timeouts

This is a technical post. Regular Fastmail users subscribed to receive email updates from the Fastmail blog can just ignore this post.

The average user of the Fastmail website is probably a bit different to most websites. Webmail tends to be a "productivity application" that people use for an extended period of time. So for the number of web requests we get, we probably have less individual users than other similar sized sites, but the users we do have tend to stay for a while and do lots of actions/page views.

Because of that we like to have a long HTTP keep-alive timeout on our connections. This makes interactive response nicer for users as moving to the next message after spending 30 seconds reading a message is quick because we don’t have to setup a new TCP connection or SSL session, we just send the request and get the response over the existing keep-alive one. Currently we set the keepalive timeout on our frontend nginx servers to 5 minutes.

I did some testing recently, and found that most clients didn’t actually keep the connection open for 5 minutes. Here’s the figures I measured based on Wireshark dumps.

  • Opera 11.11 – 120 seconds
  • Chrome 13 – at least 300 seconds (server closed after 300 second timeout)
  • IE 9 – 60 seconds (changeable in the registry, appears to apply to IE 8/9 as well though the page only mentions IE 5/6/7)
  • Firefox 4 – 115 seconds (changeable in about:config with network.http.keep-alive.timeout preference)

I wondered why most clients used <= 2 minutes, but Chrome was happy with much higher.

Interestingly one of the other things I noticed while doing this test with Wireshark is that after 45 seconds, Chrome would send a TCP keep-alive packet, and would keep doing that every 45 seconds until the 5 minute timeout. No other browser would do this.

After a bunch of searching, I think I found out what’s going on.

It seems there’s some users behind NAT gateways/stateful firewalls that have a 2 minute state timeout. So if you leave an HTTP connection idle for > 2 minutes, the NAT/firewall starts dropping any new packets on the connection and doesn’t even RST the connection, so TCP goes into a long retry mode before finally returning that the connection timed out to the application.

To the user, the visible result is that after doing something with a site, if they wait > 2 minutes, and then click on another link/button, the action will just take ages to eventually timeout. There’s a Chrome bug about this here:

http://code.google.com/p/chromium/issues/detail?id=27400

So the Chrome solution was to enable SO_KEEPALIVE on sockets. On Windows 7 at least, this seems to cause TCP keep-alive pings to be sent after 45 seconds and every subsequent 45 seconds, which avoids the NAT/firewall timeout. On Linux/Mac I presume this is different because they’re kernel tuneables that default to much higher. (Update: I didn’t realise you can set the idle and interval for keep-alive pings at the application level in Linux and Windows)

This allows Chrome to keep truly long lived HTTP keep-alive connections. Other browsers seem to have worked around this problem by just closing connections after <= 2 minutes instead.

I’ve mentioned this to the Opera browser network team, so they can look at doing this in the future as well, to allow longer lived keep-alive connection.

I think it’s going to be a particularly real problem with Server-Sent Event type connections that can be extremely long lived. We’re either going to have to send application level server -> client pings over the channel every 45 seconds to make sure the connection is kept alive, or enable a very low keep-alive time on the server and enable SO_KEEPALIVE on each event source connected socket.

robm@fastmail.fm

Posted in Technical. Comments Off

Download non-english filenames

This is a technical post. Regular Fastmail users subscribed to receive email updates from the Fastmail blog can just ignore this post.

When you want to send a download file to a user based on a web request, it’s well known you can just set the Content-Disposition header to attachment to get the browser to download the content and save it locally on the users machine. Additionally you can add a filename parameter to control the filename displayed to users.

Content-Disposition: attachment; filename="foo.txt"

The problem comes when the filename contains non-english (really non-ASCII characters). RFC2231 defines a way of adding character set encodings to MIME parameters. Unfortunately support for this RFC is scattered, and browsers have implemented various internal hacks/workarounds (eg. % URL encoded octets). The situation is sufficiently complicated that someone came up with a comprehensive set of tests and there’s a good stackoverflow answer.

However looking over the test case examples, I realised that there appeared to be a solution that would work on all browsers except Safari quite well. The attwithfn2231utf8 test shows that all modern browsers except IE and Safari support the RFC2231 encoding. The attfnboth test shows that if you have a traditional filename parameter followed by a RFC2231 filename* parameter, IE and Safari pick the traditional parameter. The attwithfnrawpctenclong test shows that if you use % URL encoded octets in a traditional filename parameter, IE attempts to decode them as UTF-8 octets.

Putting that together, if you want to send a file called foo-ä.html, then setting a header of:

Content-Disposition: attachment; filename="foo-%c3%a4.html"; filename*=UTF-8''foo-%c3%a4.html

Will cause IE8+, Opera, Chrome, FF4+ (but not Safari) to correctly save a file named foo-ä.html. This should be easy to do with a URL escaping library that encodes UTF-8 octets not in the unreserved character set.

robm@fastmail.fm

Posted in Technical. Comments Off

World IPv6 day

This is a technical blog post about Fastmail’s support of an internet standard called ipv6. Most users shouldn’t notice any difference at all, and can ignore this post. For those interested, we’ve included some description about what we’ve done to support ipv6.

We didn’t actually get organised enough to register ourselves for world ipv6 day, but we got ipv6 up and running in time, so we’ve enabled it anyway.  You can read more about ipv6 day here: http://www.worldipv6day.org/

Our prefix for NYI is 2610:1c0:0:1:: – and for convenience we’re mapping all the IP addresses from our ipv4 space (66.111.4.0/24) as direct offsets into that ipv6 space.  All the public service IPs are now bound in ipv4 and ipv6.  There was some magic requried to support our failover system because Linux doesn’t offer an option to bind non-local ipv6 addresses – so we do a little dance where the address is bound to the loopback interface on one machine as a /128 (host only) address – or to the external interface as a /64 (fully networked) address depending on where the service is located.  It seems to work OK, which is the main thing!

Due to rbl issues and the lack of working reverse DNS, we have not enabled ipv6 for inbound or outbound SMTP, and our DNS server doesn’t support ipv6 connectivity, so all your DNS queries will still be over ipv4.

The domains with ipv6 support (AAAA records) are:

  • mail.messagingengine.com
  • web.messagingengine.com
  • dav.messagingengine.com
  • http://www.fastmail.fm
  • mail.opera.com

This will pick up the majority of web, imap, pop and smtp authenticated traffic. If you have ipv6 connectivity, you should be transparently using ipv6 now.  We are getting random bits of ipv6 traffic in the logs, so it’s clearly working.

Our FTP server doesn’t support ipv6 EPRT and EPSV commands, so I haven’t added a record for ftp.messagingengine.com.

You can also try ipv6.messagingengine.com if you want to guarantee you’re using ipv6 only. That host doesn’t have an A record for ipv4.

Unless there are reports of significant problems with this experiment, we will remain dual stack into the future :)

Posted in Technical. Comments Off

Outage report – a cascade of errors

On Friday we had one of the worst outages we’ve had in over 3 years. For at least 3 hours, all accounts were inaccessible (web, IMAP and POP), and for a few users, it was several hours longer than that. For some other users, there were additional mailbox and quota problems after that.

Obviously this is something we never want to happen, and over the years we’ve setup many systems to avoid outages like this occurring.

Summary

A small “trivial” change to a configuration program that was rolled out caused a cascading series of events that resulted in some important files being corrupted. We had to take down all services and rebuild the corrupted files from the last backup and add in any changes since the backup. Once rebuilt, we were able to bring back up all services. A separate corruption issue that affected a few users caused some longer outages and quota issues until we fixed those mailboxes.

Future mitigation

We’ve identified the chain of events that caused the problems. Because it’s actually a chain of events, we’ve identified at least 5 separate issues to fix, so that this problem, or something similar to it won’t happen again, and we’ll be doing those over the next week.

Of course we can never be 100% sure that there aren’t other cascade event paths that will cause outages, but by learning from past mistakes, fixing the known problems, continuously enhancing our test infrastructure, and being more aware of possible consequential errors in the future, we’re always aiming to minimise the chance of them occurring and providing the highest reliability possible.

Technical description

Although it’s draining and frustrating to see a large problem like this occur that so badly affects our users, it’s been fascinating to actually investigate this problem. What we end up seeing is how a set of little mistakes, bad timing, decisions made long ago, and human error (all of which are wonderfully obvious in hindsight) end up causing a much bigger problem than the initial trigger would ever suggest.

The domino effect

In this case, the problem stemmed from a cascading sequence of issues that started with a single misplaced comma.

  1. Cyrus configuration file error. The underlying trigger of the cascade was a single misplaced comma in a configuration file. In this case the error was detected by the developer during testing, who fixed the problem, but unfortunately pushed the fix to a different branch.
  2. Core dump behaviour of fatal(). The effect of the broken configuration file was that immediately after forking a new imapd process, it would try to parse the configuration file, fail to do so, and call the fatal() function. Normally that would just cause the process to exit. However our branch of cyrus has a patch we added that means all calls to the fatal() function dump core instead of just exiting; this is normally very useful for debugging and quality control.
  3. Kernel configured to add pid to core files. We also configure our kernels with the sysctl kernel.core_uses_pid=1 which ensures that each separate process crash/abort() generates a separate core file on disk rather than overwriting the previous one. Again this is very useful for debugging.
  4. Cyrus master process doesn’t rate limit forking of child processes. The cyrus master process that forks child processes doesn’t do enough sanity checking. Specifically, if an imapd exits immediately after forking, the master process will happily immediately fork another imapd, despite there being zero chance that the new imapd will do any better. At the very least this leads to a CPU-chewing loop (as well as a non-functional imapd) as each forked imapd process immediately exits and the master creates a new one.
  5. Core files end up in cyrus meta directory. Cyrus supports the concept of separating meta-data files from email data. This is very useful as it allows us to place the small  but "hot" meta data files on fast (but small) drives (eg. 10k/15k RPM drives, or SSD drives in new machines), and place the email data files on slower and much larger disks. The "cores" directory where core dumps end up is located on the same path as the meta data directory.
  6. Cyrus skiplist database format can corrupt in disk full conditions. Cyrus stores some important data in an internal key/value database format called skiplist. The most important data is a list of all mailboxes on the server. This database format works very well for the way cyrus accesses data, it’s been very fast and robust. However it turns out the code doesn’t handle the situation where a disk fills up and writes only partially succeed, causing database corruption.

Putting all the above together, creates the disaster. A small configuration change was rolled out. Every new incoming IMAP connection would cause a new imapd to be forked and immediately abort and dump core. Each core file would end up with a separate filename. This very quickly caused the cyrus meta partitions to fill up. They reached 100% full before we fully realised what was happening. This caused changes to the mailboxes database to only partially write, causing a lot of them to become corrupted.

When we realised this is what had happened, we quickly stopped all services, undid the change, and tried to recover the corrupted databases. Fortunately the databases are backed up each half hour, and there are replicas as well. Using some libraries, we were able to quickly put together code that pulled any still valid records, records from the backup, and records from the replicas and combined them, and rebuilt the mailboxes databases, and then started everything back up.

Adding insult to injury

Fortunately for most people, the mess stopped there. Unfortunately for a few users, there were some additional problems as well.

As well as the mailboxes database corruptions, it was discovered that the code that maintains the cyrus.header and cyrus.index files also didn’t like the partial writes that disk full conditions generate. This caused a small number of mailboxes to be completely inaccessible (Inbox select failed).

Fortunately cyrus has a simple utility to fix corruptions of this form called "reconstruct", so we ran that to fix up any broken mailboxes. Fixing up a mailbox with reconstruct however doesn’t fix up quota calculations, and that has a separate utility "quota" that you can run with a –f flag to fix quotas. We ran that on users to make sure all quota calculations were correct.

Unfortunately there’s a bug in the quota fix code that in some edge cases can double the apparent quota usage of users. This caused a number of accounts to have an incorrect quota usage to be set on their account, and in some cases caused them to go over their quota, causing new messages to be delayed or bounced.

The little test that cried wolf

However the story wouldn’t be complete without the additional human errors that let this happen as well. Thanks to help from the My Opera developer Cosimo, we internally have a Jenkins continuous integration (CI) server setup. This means that on every code/configuration commit, the following tests occur:

  • Roll back a virtual machine instance to a known state
  • Start the virtual machine
  • git pull the latest repository code
  • Re-install all the configuration files
  • Start all services
  • Run a series of code unit tests
  • Run a series of functional tests that test logging into the web interface, sending email, receiving the email, reading the email, and much more. There’s also a series of email delivery tests to check that all the core aspects of spam, virus and backscatter checking work as expected.

This test should have picked up the problem. So what happened? Well a day or so before the problem commit, another change occurred that altered the structure of branches on the primary repository. This caused the git pull the CI server does to fail, and thus the CI tests to fail.

While we were in the process of working out what was wrong and fixing this on the CI server (only pulling the "master" branch turned out to be the easiest fix), the problematic commit went in. So once we fixed up the git pull issue, we then found that the very first test on the CI server was failing with some strange IMAP connection failure error. Rather than believing this was a real problem, we assumed it was due to something else with the CI tests being broken after the branch changes, and resolved to look at it on Monday. Of course the test really was broken due to a bad commit, and as Murphy’s Law would dictate, someone else would do a rollout on Friday Norway time of the broken commit.

Putting it all together

A combination of software, human and process errors all came together to create an outage that affected all Fastmail users for several hours at least, and some users even more.

Obviously we want to ensure this particular problem doesn’t happen again, and more importantly, that processes are in place where possible to avoid other cascade type failures in the future. We already have tickets to fix the particular code and configuration related issues in this case:

  • link cyrus cores and log directories to another partition
  • make skiplist more robust in case of a disk full condition
  • make cyrus.header/index/cache more robust in case of a disk full condition
  • cyrus master respawning should backoff in the case of immediate crashes
  • fix cyrus quota -f to avoid random quota usage doubling

We’ve also witnessed how important keeping the CI tests clean, and tracking down all failures are important. We’ve immediately added new tests to sanity check database, imap and smtp connections as a very first step before any other functional tests are run. If any of them fail, we tail the appropriate log files and list the contents of the core directories, so the CI failure emails that are sent to all developers will make it very clear that there’s a serious problem that needs immediate investigation.

Posted in News, Technical. Comments Off

New http://m., http://old., http://beta. and http://ssl. site prefixes

We’ve created a set of new domains for users to use to enable/test various features. These new prefixes should be added to the front what whichever domain you use. For instance instead of http://www.fastmail.fm, you can use http://m.fastmail.fm, http://old.fastmail.fm, http://beta.fastmail.fm and http://ssl.fastmail.fm. This also applies to all Fastmail domains, such http://m.eml.cc, http://old.myfastmail.com, http://beta.sent.com, etc.

Force the mobile web interface via http://m.

Fastmail will attempt to detect if you’re using a mobile device, and display a mobile optimised version of the site if it detects that to be the case (eg Opera Mini, Opera Mobile, iPhone, etc). However it’s not possible to detect all devices, or in some cases the detection may be incorrect.

By using the http://m. prefix (eg http://m.fastmail.fm), this will force the mobile version of the site to be displayed.

Note that this is separate to the WAP version of the site, which is a very simple interface optimised for extremely low end phones that only have a WAP browser, which is different to a web browser. We generally don’t recommend the WAP site. For most low end phones we recommend using Opera Mini and the mobile http://m.fastmail.fm site.

Use the old web interface via http://old.

As mentioned the other day, we’re moving the old web interface from http://www.fastmail.fm/old/ to http://old.fastmail.fm.

The old web interface is deprecated. No more development or updates are being made to it. Features will be progressively disabled where they conflict with new changes (eg database changes, IMAP server changes, etc). We highly recommend users of the old web interface switch to the new interface. The improved search and keyboard shortcuts alone are a huge productivity improvement.

Sometime soon we’ll also be removing the user web interface preference, so that to login to the old web interface you will have to use http://old.fastmail.fm, using http://www.fastmail.fm will always login to the new interface. We’ll be letting users of the old web interface know about that change shortly.

Use the beta web interface via http://beta.

The beta interface is where we test new features before rolling them out to production. We try and keep the beta interface stable, but we definitely don’t guarantee it. It may have serious bugs that cause downtime and/or email loss. If you like living on the bleeding edge you can use it, but for general day to day usage we don’t recommend it.

Previously the beta server lived at http://www.fastmail.fm/beta/ but is moving to http://beta.fastmail.fm.

Force redirect to https:// (SSL encrypted) version of the site via http://ssl.

Fastmail supports over 100 different domains for users to signup at, as well as thousands of hosted domains for users, families and businesses. Unfortunately because of the way SSL encryption works, you need a separate SSL certificate for every domain (yes, there are some exceptions to this such as wildcards and SANs, but the general rule applies). It would be prohibitively expensive to buy SSL certificates for every domain we support.

Instead whenever we want to secure a connection, we redirect a user to our https://www.fastmail.fm domain. However this can be a little confusing users if we immediately do this when they go to our other addresses like http://eml.cc, http://sent.com, etc, so over the years we’ve built a slightly complex set of rules.

When you first enter a domain in your browser (eg http://eml.cc), we don’t redirect. However if you click "Secure Login", we will replace the target of the post request to https://www.fastmail.fm so the content (eg your username and password) is encrypted.

At that point, we also set a cookie at the eml.cc domain, so the next time you go to http://eml.cc, it automatically redirects immediately to https://www.fastmail.fm so everything is immediately encrypted. This was done because the default login button (eg "Secure Login" or "Login") used to be set depending on if you were at an https:// or http:// domain respectively. If you clicked "Secure Login", we assumed you wanted "Secure Login" to be the default next time. This isn’t as relevant now that "Secure Login" is always the default, but it’s still good practice to redirect to the secure site immediately.

To add to these issues is the way usernames and domains interact. If you have an account bob@eml.cc, then if you go to http://eml.cc, you can login with just the username "bob". However if you go to http://www.fastmail.fm, you would have to login with the full name "bob@eml.cc". Note that this works even if a redirect occurs. That is, if you go to http://eml.cc and you’re redirected to https://www.fastmail.fm because of a previous "Secure Login" you did, you can still login with just "bob", Fastmail remembers the "original" domain you arrived on.

However this doesn’t help with users that use public terminals a lot. There won’t be a redirect cookie to go to the secure site by default. To help users with non @fastmail.fm addresses who are security conscious, we’ve add http://ssl. prefixes to all sites (eg. http://ssl.eml.cc) which causes an immediate redirect to https://www.fastmail.fm while remembering the original domain.

This is a small tweak, but useful for some people that are security conscious.

Posted in News, Technical. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 4,997 other followers