A story of leaping seconds

I’ve been promising to blog more about some technical things about the FastMail/Opera infrastructure, and the recent leap second fiasco is a good place to point out not only some failures, but also the great things about our system which helped limit the damage.

Being on duty

First of all, it was somewhat random that I was on duty. We’ve recently switched to a more explicit “one person on primary duty” from the previous timezone and time-of-day based rules. Partially so other people knew who to contact, and partially to bring some predictability to when you had to be responsible.

Friday 29th June was my last working day for nearly 3 weeks. My wife is on the other side of the world, and the kids are off school. I was supposed to be handing duty off to one of the Robs in Australia (we have two Robs on the roster – one Rob M and the other Rob N. This doesn’t cause any confusion because we’re really good at spelling). I had made some last minute changes (as you always do before going away) and wanted to keep an eye on them, so I decided to stay on duty.

Still, it was the goodbye party for a colleague that night, so I took the kids along to Friday Beer (one of the awesome things about working for Opera) and had quite a few drinks. Not really drunk, but not 100% sober either. Luckily, our systems take this into account… in theory.

The first crash

Fast forward to 3:30am, and I was sound asleep when two SMSes came in quick succession. Bleary eyed, I stumbled out to the laptop to discover that one of our compute servers had frozen. Nothing on the console, no network connectivity, no response to keystrokes via the KVM. These machines are in New York, and I’m in Oslo, so kicking it in person is out of the question – and techs onsite aren’t going to get anything I don’t.

These are Dell M610 blades. Pretty chunky boxes. We have 6, split between two bladecentres. We can easily handle running on 3, mail delivery just slows by a few milliseconds – so we have each one paired with one in another bladecentre and two IP addresses shared between them. In normal mode, there’s one IP address per compute server.  (amongst the benefits of this setup – we can take down an entire bladecentre for maintenance in about 20 minutes – with no user visible outage)

Email delivery is hashed between those 6 machines so that the email for the same user goes to the same machine, giving much better caching of bayes databases and other per-user state. Everything gets pushed back to central servers on change as well, so we can afford to lose a compute server, no big deal.

But – the tool for moving those IP addresses, “utils/FailOver.pl”, didn’t support the ‘-K servername’ parameter, which means “assume that server is dead, and start up the services as if they had been moved off there”.  It still tries to do a clean shutdow, but on failure it just keeps going. The idea is to have a tool which can be used while asleep, or drunk, or both…

The older “utils/FailOverCyrus.pl”, which is still used just for the backend mail servers, does support -K. It’s been needed before. The eventual goal is to merge these two tools into one, which supports everything. But other things keep being higher priority.

So – no tool. I read the code enough to remind myself how it actually works, and hand wrote SQL statements to perform the necessary magic on the database so that the IP could be started on the correct host.

I could have just bound the IP by hand as well. But I wanted to not confuse anything. Meanwhile I had rebooted the host, and everything looked fine, so I switched back to normal mode.

Then (still at 3:something am) I wrote -K into FailOver.pl. Took a couple of attempts and some testing to get it right, but it was working well before I wrote up the IncidentReport and went back to bed some time after 4.

The IncidentReport is very important – it tells all the OTHER admins what went wrong and what you did to fix it, plus anything interesting you saw along the way. It’s super useful if it’s not you dealing with this issue next time. It included the usage instructions for the new ‘-K’ option.

Incident 2 – another compute server and an imap server

But it was me again – 5:30am, the next blade over failed. Same problem, same no way to figure out what was wrong. At least this time I had a nice tool.

Being 5:30am, I just rebooted that blade and the other one with the same role in the same datacentre, figuring that the thing they had in common was bladecentre2 compute servers.

But while this was happening, I also lost an imap server. That’s a lot more serious. It causes user visible downtime – so it gets blogged.

From the imap server crash, I also got a screen capture, because blanking wasn’t enabled for the VGA attached KVM unit. Old server. This capture potentially implicated ntpd, but at the time I figured it was still well in advance of the leap second, so probably just a coincidence. I wasn’t even sure it was related, since the imap server was an IBM x3550, and the other incidents had been blades. Plus it was still the middle of the night and I was dopey.

Incident 3

Fast forward to 8:30am in the middle of breakfast with my very forgiving kids who had let me sleep in. Another blade, this time in the other blade centre. I took a bit longer over this one – decided the lack of kernel dumps was stupid, and I would get kdump set up. I already had all the bits in place from earlier experiments, so I configured up kdump and rolled it onto all our compute servers. Rob M (see, told you I could spell) came online during this, and we chatted about it and agreed that kdump was a good next step.

I also made sure every machine was running the latest 3.2.21 kernel build.

Then we sat, and waited, and waited. Watched pots never boil.  I also told our discussion forum what was going on, because some of our more technical users are interested.

Incident 4

One of the other IMAP servers, this time in Iceland!

We have a complete live-spare datacentre in Iceland. Eventually it will be a fully operational centre in its own right, but for now it’s running almost 100% in replica mode. There are a handful of internal users with email hosted there to test thing (indeed, back at the start, my “urgent” change which meant I was still on call was changing how we replicate filestorage backends around so that adding a second copy in Iceland to be production rather than backup didn’t lead to vast amounts of data being copied through the VPN to New York and back while I was away)

So this was a Dell R510 – one of our latest “IMAP toasters”. 12 x 2Tb SATA drives for data (in 5 x RAID1 and 2 hot spares) for 20 x 500Gb cyrus data stores, 2 x SSD in RAID1 for super-hot Cyrus metadata, 2 x quadcore processor, 48 Gb RAM. For about US $12k, these babies can run hundreds of thousands of users comfortably. They’re our current sweet spot for price/performance.

No console of course, and no kdump either. I hadn’t set up the Iceland servers with kdump support yet. Doh.

One thing I did do was create an alias. Most of our really commonly used commands have super short aliases. utils/FailOver.pl is now “fo”. There are plenty of good reasons for this.

Incident 5

One reason is phones. This was Saturday June 30th, and I’m taking the kids to a friend’s cabin for a few days during the holidays. They needed new bathers, and I had promised to take them shopping. Knowing it could be hours until the next crash, I set up the short alias, checked I could use it from my phone, and off we went. I use Port Knocker + Connect Bot from my Android phone to get a secure connection into our production systems while I’m on the move.

So incident 5 was another blade failing – about 5:30pm Oslo time, while I was in the middle of a game shop with the kids. Great I thought, kdump will get it. I ran the “fo” command from my phone, 20 seconds of hunt and peck on a touch keyboard, and waited for the reboot. No reboot. Couldn’t ping it.

Came home and checked – it was frozen. Kdump hadn’t triggered. Great.

As I cooked dinner, I chatted with the sysadmins from other Opera deparements on our internal IRC channel. They confirmed similar failures all around the world in our other datacentres. Everything from unmodified Squeeze kernels through to our 3.2.21 kernel. I would have been running something more recent, but the bnx2 driver has been broken with incorrect firmware since then – and guess what the Dell blades contain. Other brands of server crashed too, so it wasn’t just Dell.

Finding a solution

The first thing was to disable console blanking on all our machines. I’ve been wanting to do it for a while, but I down-prioritised it after getting kdump working (so I thought). By now we were really suspicious of ntp and the leap second, but didn’t have “proof”. A screen capture of another crash with ntp listed would be enough for that. I did that everywhere – and then didn’t get another crash!

Another thing I did was post a question to superuser.com. Wrong site – thankfully someone moved it to serverfault.com where it belonged. My fault, I have been aware of these sites for a while, but not really participated. The discussion took a while to start up, but by the time the kids were asleep, it had exploded. Lots of people confirming the issue, and looking for workarounds.  I updated it multiple times as I had more information from other sysadmins and from my own investigations.

Our own Marco had blogged about solutions to the leap second, including smearing, similar to what Google have done.  I admit I didn’t give his warnings the level of attention I should have – not that any of us expected what happened, but he was at least aware of the risk more than some of us cowboys.

Marco was also on IRC, and talked me though using the Perl code from his blog to clear the leapsecond flag from the kernel using adjtimex. I did that, and also prepared the script a bit for more general use and uploaded it to my filestorage space to link into the serverfault question where it could be useful to others.

By now I was the top trending piece of tech news, and scoring badges and reputation points at a crazy rate. I had killed NTP everywhere on our servers and cleared the flag, so they were marching on oblivious to the leap second – no worries there. So I joined various IRC channels and forums and talked about the issue.

I did stay up until 2am Oslo time to watch the leap second in.  In the end the only thing to die was my laptop’s VPN connection, because of course I hadn’t actually fixed the local end (also running Linux).  There was a moment of excitement before I reconnected and confirmed that our servers were all fine.  10 minutes later, I restarted NTP and went to bed.

The aftermath: corruption

One of the compute server crashes had corrupted a spamassassin data file, enough that spam scanning was broken.  It took users reports for us to be aware of it.  We have now added a run of ‘spamassassin –lint’ in startup scripts of our compute servers, so we can’t operate in this broken state again.

We also reinstalled the server.  We reinstall at almost any excuse.  The whole process is totally automated.  The entire set of commands was

# fo -a    (remember the alias from earlier)
# srv all stop
# utils/ReinstallGrub2.pl -r     (this one has no alias)
# reboot

and about 30 minutes later, when I remembered to go check

# c4 (alias for ssh compute4)
# srv all start
# c1 (the pair machine for compute4 is compute1)
# fo -m (-a is “all”, -m is “mastered elsewhere”)

And we were fully back in production with that server.  The ability to fail services off quickly, and reinstall back to production state from bare metal, is a cornerstone of good administration in my opinion.  It’s why we’ve managed to run such a successful service with less than one person’s time dedicated to sysadmin.  Until 2 months ago, the primary sysadmin has been me – and I’ve also rewritten large parts of Cyrus during that time, and worked on other parts of our site too.

The other issue was imap3 – the imap server which crashed right back during incident 2.  After it was up, I failed all Cyrus instances off, so it’s only running replicas right now.  But the backup script goes to the primary location by default.

I saw two backup error messages go by today (while eating my breakfast – so much for being on holiday – errors get sent to our IRC channel and I was still logged in).  They were missing file errors, which never happen.  So I investigated.

Again, we have done a LOT of work on Cyrus over the years (mostly, I have), and one thing has been adding full checksum support to all the file formats in Cyrus.  With the final transition to the twoskip internal database format I wrote earlier this year, the only remaining file that doesn’t have any checksum is the quota file – and we can regenerate that from our billing database (for limit) and the other index files on disk (for usage).

So it was just a matter of running scripts/audit_slots.pl with the right options, and waiting most of the day.  The output contains things like this:

# user.censored1.Junk Mail cyrus.index missing file 87040
# sucessfully fetched missing file from sloti20d3p2

# user.censored2 cyrus.index missing file 18630
# sucessfully fetched missing file from slots3a2p2

 The script has enough smarts to detect that files are missing, or even corrupted.  The cyrus.index file contains a sha1 of each message file as well as its size (and a crc32 of the record containing THAT data as well), so we can confirm that the file content is undamaged.  It can connect to one of the replicas using data from our “production.dat” cluster configuration file, and find the matching copy of the file – check that the sha1 on THAT end is correct, and copy it into place.

The backup system knows the internals of Cyrus well.  The new Cyrus replication protocol was built after making our backup tool, and similar checks the checksums of data it receives over the wire.  At every level, our systems are designed to detect corruption and complain loudly rather than blindly replicating the corruption to other locations.  We know that RAID1 is no protection, not with the amount of data we have.  It’s very rare, but with enough disks, very rare means a few times a month.  So paranoia is good.

Summary

All these layers of protection, redundancy, and tooling mean that with very little work, even while not totally awake, the entire impact on our users was an ~15 minute outage for the few users who were primary on imap3 (less than 5% of our userbase) – plus delays of 5-10 minutes on incoming email to 16.7% of users each time a compute server crashed.  While we didn’t do quite as well as Google, we did OK!  The spam scanning issue lasted a lot longer, but at least it’s fully fixed for the future.

We had corruption, but we have the tools to detect and recover, and replicas of all the data to repair from.

Posted in Technical. Comments Off

Building the new AJAX mail UI part 2: Better than templates, building highly dynamic web pages

This is part 2 of a series of technical posts documenting some of the interesting work and technologies we’ve used to power the new interface (see also part 1, Instant notifications of new emails via eventsource/server-sent events). Regular users can skip these posts, but we hope technical users find them interesting.

As dynamic websites constructed entirely on the client side become de rigueur, there are a number of templating languages battling it out to become the One True Way™ of rendering your page. All follow essentially the same style: introduce extra control tags to intersperse with HTML. But if we go back to basics, HTML is simply a way of serialising a tree structure into a text format that is relatively easy for humans to edit. Once the browser receives this, it then has to parse it to generate an internal DOM tree representation before it can draw the page.

In an AJAX style application, we don’t transmit HTML directly to the browser. Instead, we generate the HTML on the client side, and often update the HTML in different parts of the page over time as the user interacts with the application. As string manipulation for building HTML from data objects is hard to write and error-prone, we normally use a template language and a library that compiles these snippets into code; this executes with a data context, producing a string of HTML that may be set as an element’s innerHTML property. The browser then builds a DOM tree, which we can query to update nodes and add event listeners.

There is, however, another alternative for building the DOM tree: directly in JavaScript. Modern browsers are very fast at parsing and executing JavaScript. What if, with the help of a liberal sprinkling of syntactic sugar, we were to build the DOM tree in code instead? Start by considering a simple function el to declare an element.

el( 'div' )

OK, so far we’ve just renamed the document.createElement method. What next? Well, we’re going to want to add class names and ids to elements a lot. Let’s use the CSS syntax which everyone knows and loves.

el( 'div#id.class1.class2' );

Hmm, that’s quite clean and readable compared to:

<div id="id" class="class1 class2"></div>

What else? Well, there may be other attributes. Let’s pass them as a standard hash:

el( 'div#id', { tabindex: -1, title: 'My div' })

That’s pretty neat. Let’s have a quick look at the html for comparison:

<div id="id" tabindex="-1" title="My div"></div>

A node’s not much use on its own. Let’s define a tree:

var items = [ 1, 2, 3, 4 ];
el( 'div#message', [
    el( 'a.biglink', { href: 'http://www.google.com' }, [
        'A link to Google'
    ]),
    el( 'ul', [
        items.forEach( function( item ) {
            return el( 'li.item', [ item + '. Item ] );
        })
    ]),
    items.length > 1 ? 'There are lots of items'.localise() + '. ' : null,
    'This is just plain text. <script>I have no effect</script>'
])

So what have we achieved? We’ve got a different way of writing a document tree, which is essentially very similar to HTML but changes the punctuation slightly to make it valid JavaScript syntax instead. So what? Well, the point is this readable declaration is directly executable code; we just need to define the el function: https://gist.github.com/1532562. As it’s pure JS, we can replace static strings with variables. We can easily add conditional nodes, as shown in the example above. We can call other functions to generate a portion of the DOM tree or use array iterators to cleanly write loops. Wrap it all in a function and we can pass different data into the function each time to render our DOM nodes… we have ourselves a template.

Performance

While innerHTML used to be much faster than JS DOM methods, this no longer holds for modern browsers. Let’s have a look at a benchmark: http://jsperf.com/innerhtml-or-dom/4

Here we have four different methods of rendering the same bit of HTML. This is a real-world snippet, taken from a core part of our new webmail application (https://beta.fastmail.fm), with just a few class names changed. Let’s first look at the hand-optimised innerHTML method and hand-optimised DOM method. In Chrome the DOM version is over 50% faster than using innerHTML and in Safari it’s 45% faster. Firefox is just as fast with either, while Opera is marginally faster using innerHTML. IE is still twice as fast using innerHTML rather than DOM methods. Perhaps most interesting though is to look at mobile browser performance. On desktop, computers are fast enough these days that the performance differences are less of an issue. On mobile it’s crucial, and here we find that the DOM method is anywhere from 45% to 100% faster in mobile WebKit browsers, such as Safari on the iPhone and the default Android browser, and level with innerHTML on Opera Mobile.

A few things to note before we look at the real-world tests. Firstly, for maximum speed, the innerHTML method is assuming all text is already escaped; a very dangerous assumption. The DOM method on the other hand needs to make no such assumptions, as text is added to the DOM tree by creating text nodes. Since the text is never parsed as HTML, there is zero chance of accidentally injecting a malicious script tag. Secondly, if you need a reference to any of the DOM nodes you’re creating (for example to save for updating later or to add event listeners), with the innerHTML method you must query the DOM after you’ve constructed it. With direct DOM construction, you already have the node reference; you just save it as you create it.

These hand-optimised functions are fast, but unmaintainable and a pain to write. Let’s move on to something we would use on a real website.

Handlebars is a popular JS templating language, and claims to be one of the fastest around. It produces a string for use with innerHTML to construct the DOM elements. Let’s compare that to the JS declarative approach I outlined above (which I’m going to call Sugared DOM). Compared to the raw methods, the Sugared DOM was more-or-less equal in performance to the hand-optimised innerHTML in Chrome and Safari, even on the iPhone. It’s equal to or faster than Handlebars templates (sometimes by a significant margin) in all browsers other than IE, and crucially on mobile browsers it’s anywhere from 50% to 100% faster. Note too that the initial compilation time for Handlebars templates is not included in these benchmarks.

Conclusion

On almost all modern browsers the Sugared DOM method is faster than normal templates, even when ignoring the compile-time cost the latter have. There are other benefits as well:

  • Easy to debug (the template declaration is the code).
  • The sugar code is much smaller than any decent templating library.
  • No need to query the DOM, as you can just save references to nodes you’ll need later as you create them. This is faster and may remove the need for a whole JS library you currently use (like Sizzle).
  • No escaping worries; zero chance of XSS bugs. When you include a string in the declaration it is explicitly set as a text node, so is never parsed as HTML. <script> tags are harmless!
  • No extraneous white-space text nodes. White space between block-level nodes in HTML does not affect the rendering, but it does add extra nodes to the DOM. These can be a pain when you’re manipulating it later (the firstChild property may not return what you expect) and increases the memory usage of the page.
  • As it’s pure JS, the templates can be easily included inline as part of view classes that also handle the behaviour of the view, or kept in separate files.
  • JSHint will validate your syntax; much easier than tracking down syntax errors from a template’s compiler.
  • Flexibility to use the full power of JS; easily call other functions to generate parts of your DOM tree, localise a string, or do whatever else you like.

What are the downsides? Well, it’s slightly slower in Internet Explorer (although still plenty fast enough in real world use) and the difference in syntax to HTML may take a little time to become accustomed to, especially if templates are written by designers rather than coders (then again every template introduces its own syntax, so I’m not sure there’s much difference here). And, err, I think that’s about it.

It’s time to ditch HTML based templates. Embrace the DOM, and enjoy your powerful, fast and readable new way to render pages.

Written by Neil Jenkins

Posted in News, Technical. Comments Off

New File Storage backend

You may notice that the Files screen loads quite a lot faster now, particularly if you have many folders.

The File Storage backend has been in need of an overhaul for a long time.  In order to have reliable cache expiry, it was quite single threaded.  We have been throwing around various ideas about how to make those changes for a while, and today I finally set a full day aside, ignoring all other issues, and tested a few things.

One major thing has changed since I first wrote the VFS code 6 years ago.  We’re not constrained by memory for metadata any more.  The VFS had small limits set to avoid blowing out the memory on an individual web process.  Well.  Our smallest web server has 24Gb RAM now.  The newer ones have 48.  The smallest DB server has 64Gb RAM.  There’s no point in caching hot data to disk, because it will be in memory anyway.

So the eventual change was to throw away all the caching layers except one very temporary in-memory one.  There were 1 disk (tmpfs) and two in-memory layers of caching before, so it probably actually saves memory anyway.

The code was also very general, which is fine – but a couple of carefully thought out queries later, I could make one DB fetch to get the full directory tree, plus metadata, and pre-populate the in-memory cache with the fields it was about to ask for.  This, again, is much more efficient than pulling the data from a local cache and checking it for staleness.

The end result – faster response, simpler code, and a few bugs (particularly with long-lived FTP connections) cleared up.

I also backported all the changes to the oldweb interface, so attachments on the compose screen still work, and the Files screen there still works.

The take-home lesson from all this, keep it simple stupid.  The caching complexity isn’t needed any more, if it ever was, and the simpler architecture will help.  I didn’t even have to make any DB schema changes (except dropping a couple of no-longer-used Cache management tables)

There should be no user-visible changes from any of this.  The APIs are all identical for our application layers: webdav, ftp, websites and the Files screen are all the same.

Posted in Technical. Comments Off

Building the new AJAX mail UI part 1: Instant notifications of new emails via eventsource/server-sent events

With the release of the new AJAX user interface into testing on the Fastmail beta server, we decided that it might be interesting to talk about the technology that has gone into making the new interface work. This post is the first of a series of technical posts we plan to do over the next few months, documenting some of the interesting work and technologies we’ve used to power the new interface. Regular users can skip these posts, but we hope technical users find them interesting.

We’re starting the series by looking at how we push instant notifications of new email from the server to the web application running in your browser. The communication mechanism we are using is the native eventsource/server-sent events object. Our reasons for choosing this were threefold:

  1. It has slightly broader browser support than websockets (eventsource vs websockets)
  2. We already had a well defined JSON RPC API, using XmlHttpRequest objects to request data from the server, so the only requirement we had was for notifications about new data, which is exactly what eventsource was designed for
  3. For browsers that don’t support a native eventsource object, we could fallback to emulating it closely enough without too much extra code (more below), so we need only maintain one solution.

We’re using native eventsource support in Opera 11+, Chrome 6+, Safari 5+ and Firefox 6+. For older Firefox versions, the native object is simulated using an XmlHttpRequest object; Firefox allows you to read data as it is streaming. Internet Explorer unfortunately doesn’t, and whilst there are ways of doing push using script tags in a continually loading iframe, they felt hacky and less robust, so we just went with a long polling solution there for now. It uses the same code as the older-Firefox eventsource simulation object, the only difference is that the server has to close the connection after each event is pused; the client then reestablishes a new connection immediately. The effect is the same, it’s just a little less efficient.

Once you have an eventsource object, be it native or simulated, using it for push notifications in the browser is easy; just point it at the right URL, then wait for events to be fired on the object as data is pushed. In the case of mail, we just send a ‘something has changed’ notification. Whenever a new notification arrives, we invalidate the cache and refresh the currently displayed view, fetching the new email.

On the server side, the event push implementation had a few requirements and a few quirks to work with our existing infrastructure.

Because eventsource connections are long lived, we need to use a system that can scale to a potentially very large number of simultaneous open connections. We already use nginx on our front end servers for http, imap and pop proxying. nginx uses a small process pool with a non-blocking event model and epoll on Linux, so it can scale to a very large number of simultaneous connections. We regularly see over 30,000 simultaneous http, imap and pop connections to a frontend machine (mostly SSL connections), with less than 1/10th of total CPU being used.

However, with a large number of client connections to nginx, we’d still have to proxy them to some backend process that could handle the large number of simultaneous connections. Fortunately, there is an alternative event based approach.

After a little bit of searching, we found a third party push stream module for nginx that was nearly compatible with the W3C eventsource specification. We contacted the author, and thankfully he was willing to make the changes required to make it fully compatible with the eventsource spec and incorporate those changes back into the master version. Thanks Wandenberg Peixoto!

Rather than proxying a connection, the module accepts a connection, holds it open, and connects it to an internal subscriber "channel". You can then use POST requests to the matching publisher URL channel to send messages to the subscriber, and the messages will be sent to the client over the open connection.

This means you don’t have to hold lots of internal network proxy connections open and deal with that scaling, instead you just have to send POST requests to nginx when an "event" occurs. This is done via a backend process that listens for events from cyrus (our IMAP server), such as when new emails are delivered to a mailbox, and (longer term) when any change is made to a mailbox.

Two other small issues also need to be dealt with. First is that only logged in users should be able to connect to an eventsource channel, and second is that we have two separate frontend servers and clients connect randomly to one of the other because each hostname resolves to two IP addresses, so the backend needs to send POST requests to the correct frontend nginx server the user is connected to.

We do the first by accepting the client connection, proxying to a backend mod_perl server which does the standard session and cookie authentication, and then use nginx’s internal X-Accel-Redirect mechanism to do an internal redirect that hooks the connection to the correct subscriber channel. For the second, we add a "X-Frontend" header to each proxied request, so that the mod_perl backend knows which server the client is connected to.

The stripped down version of the nginx configuration looks like this:

    # clients connect to this URL to receive events
    location ^~ /events/ {
      # proxy to backend, it'll do authentication and X-Accel-Redirect
      # to /sub/ if user is authenticated, or return error otherwise
      proxy_set_header   X-Frontend   frontend1;
      proxy_pass         http://backend/events/;
    }
    location ^~ /subchannel/ {
      internal;
      push_stream_subscriber;
      push_stream_eventsource_support on;
      push_stream_content_type "text/event-stream; charset=utf-8";
    }
    # location we POST to from backend to push events to subscribers
    location ^~ /pubchannel/ {
      push_stream_publisher;
      # prevent anybody but us from publishing
      allow   10.0.0.0/8;
      deny    all;
    }

Putting the whole process together, the steps are as follows:

  1. Client connects to https://example.com/events/
  2. Request is proxied to a mod_perl server
  3. The mod_perl server does the usual session and user authentication
  4. If not successful, an error is returned, otherwise we continue
  5. The mod_perl server generates a channel number based on the user and session key
  6. It then sends a POST to the nginx process (picking the right one based on the X-Frontend header) to create a new channel
  7. It then returns an X-Accel-Redirect response to nginx which tells nginx to internally redirect and connect the client to the subscriber channel
  8. It then contacts an event pusher daemon on the users backend IMAP server to let it know that the user is now waiting for events. It tells the daemon the user, the channel id, and the frontend server. After doing that, the mod_perl request is complete and the process is free to service other requests
  9. On the backend IMAP server, the pusher daemon now waits for events from cyrus, and filters out events for that user
  10. When an event is received, it sends a POST request to the frontend server to push the event over the eventsource connection to the client
  11. One of the things the nginx module returns in response to the PUSH request is a "number of active subscribers" value. This should be 1, but if it drops to 0, we know that the client has dropped its connection, so at that point we don’t need to monitor or push any more events for that channel, and internally cleanup so we don’t push any more events for that user and channel. The nginx push stream module automatically does this on the frontend as well.
  12. If a client drops a connection and re-connects (in the same login session), it’ll get the same channel id. This avoids potentially creating lots of channels

In the future, we will be pushing events when any mailbox changes are made, not just a new email delivery (e.g. change made in an IMAP client, a mobile client, or another web login session). We don’t currently do this because we need to filter out notifications due to actions made by the same client; since it already knows about these, invalidating the cached would be very inefficient.

In general this all works as expected in all supported browsers and is really very easy to use. We have however come across a few issues to do with re-establishing lost connections. For example, when the computer goes to sleep then wakes up, the connection will have probably been lost. Opera has a bug in that it doesn’t realise this and keeps showing that the connection is OPEN (in readyState 1).

We’ve also found a potential related issue with the spec itself: "Any other HTTP response code not listed here, and any network error that prevents the HTTP connection from being established in the first place (e.g. DNS errors), must cause the user agent to fail the connection". This means that if you lose internet connection (for example pass through a tunnel on the train), the eventsource will try to reconnect, find there’s no network and fail permanently. It will not make any further attempts to connect to the server once a network connection is found again. This same problem can cause a race condition when waking a computer from sleep as it often takes a few seconds to re-establish the internet connection. If the browser tries to re-establish the eventsource connection before the network is up, it will therefore permanently fail.

This spec problem can be worked around by observing the error event. If the readyState property is now CLOSED (in readyState 2), we set a 30 second timeout. When this fires, we create a new eventsource object to replace the old one (you can’t reuse them) which will then try connecting again; essentially this is manually recreating the reconnect behaviour.

The Opera bug in not detecting it’s lost a connection after waking from sleep can be fixed by detecting when the computer has been asleep and manually re-establishing the connection, even if it’s apparently open. To do this, we set a timeout for say 60s, then when it fires we compare the timestamp with when the timeout was set. If the difference is greater than (say) 65s, it’s probable the computer has been asleep (thus delaying the timeout’s firing), and so we again create a new eventsource object to replace the old one.

Lastly, it was reasonably straight forward to implement a fully compatible eventsource implementation in Firefox using just a normal XmlHttpRequest object, thereby making this feature work in FF3.5+ (we haven’t tested further back, but it may work in earlier versions too). The only difference is that the browser can’t release from memory any of the data received over the eventsource connection until the connection is closed (and they could be really long lived), as you can always access it all through the XHR responseText property. However, we don’t actually know if the other browsers actually make this optimisation with their native eventsource implementations, and given the data pushed through the eventsource connection is normally quite small, this certainly isn’t an issue in practice.

This means we support Opera/Firefox/Chrome/Safari with the same server implementation. To add Internet Explorer to the mix we use a long polling approach. To make the server support long polling all we do is make IE set a header on an XmlHttpRequest connection (we use X-Long-Poll: Yes), and if the server sees that header it closes the connection after every event is pushed; other than that it’s exactly the same. This also means IE can share FF’s eventsource emulation class with minimal changes.

The instant notification of new emails is one of core features of the new interface that allows the blurring of boundaries between traditional email clients and webmail clients. Making this feature work, and work in a way that we knew was scalable going forward was an important requirement for the new interface. We’ve achieved this with a straight forward client solution, and in a way that elegantly integrates with our existing backend infrastructure.

Posted in News, Technical. Comments Off

Change of default MX records for domains

This post contains some technical information mostly useful for people that host email for their own domain at FastMail.

TL;DR: If you host email for your domain at FastMail, but host the DNS for your domain at an external DNS provider, we recommend you login to your DNS provider and change the two MX records for your domain from in[12].smtp.messagingengine.com to in[12]-smtp.messagingengine.com. i.e. replace the first dot (‘.’) with a dash (‘-’)

If you host email for your domain at FastMail, and you host the DNS for your domain at FastMail, no change is required, it’s all automatically been done.

More details: For many years, the default MX records for domains hosted at FastMail have been in1.smtp.messagingengine.com and in2.smtp.messagingengine.com.

However it turns out there’s a small problem with this. The hostnames in[12].smtp.messagingengine.com don’t match the wildcard *.messagingengine.com SSL certificate we have (similar to this previous issue). So if a remote system uses opportunistic TLS encryption to send email to us, the connection will be encrypted, but it may be reported as "Untrusted" because the certificate doesn’t match.

This isn’t disastrous, but it is annoying and exposes a potential man-in-the-middle attack.

So we’ve gone and changed the DNS MX records for all domains hosted at FastMail to default to in1-smtp.messagingengine.com and in2-smtp.messagingengine.com.

For users that use us to host DNS for their domains, no change is required on your behalf, all of this has been automatically updated.

For users that use an external DNS provider, we recommend you update the MX records for your domains at your DNS hosting provider. We’ll continue to support the old in[12].smtp values for some time and alert users if/when we discontinue it, but the sooner you make the change, the better it is for the secure transmission of email to your domain.

We’ve updated our documentation to reflect these new values.

Posted in News, Technical. Comments Off

iOS 5 and mail application access patterns

This post contains some observations about how the mail application in iOS 5 appears to interact with IMAP servers. We’re posting this mostly as a reference for people interested.

In iOS settings, you can choose a "fetch interval", which is:

  • Manually (never fetches automatically)
  • Every 15 minutes
  • Every 30 minutes
  • Every hour
  • Push (only shown on servers supporting it, which I believe is currently only Exchange servers or Yahoo Mail)

If you choose "Manually", then there is no persistent connection once you exit the mail app.

If you choose any other interval, then a background daemon holds a persistent connection to the mail server. We don’t know exactly why they hold the connection open, and we’re not sure if it leaves the connection in IDLE state to get updates pushed to it. The main advantage of holding it open is probably skipping the overhead of re-authenticating/handshaking, but there’s also no good reason to explicitly close the connection after every fetch given that IMAP is supposed to be long-lived.

If you have these fetch intervals set, and then break your network connection, then iOS will attempt to reconnect the next time it wants to fetch your mail again.

Note that the intervals listed appear to be only approximate. iOS appears to be smart about batching requests together, so it gets as much work done as it can while the phone is awake or the network connection is "up". Also, opening the mail app, or opening a folder in the mail app, will often trigger a refresh too.

Posted in Technical. Comments Off

"View" link removed from attachments on message read screen in "Public Terminal" mode

When you enable the "Public Terminal" option on the login screen, Fastmail sets the "no-cache" and "no-store" cache control headers on every page. This means that browsers should not store a copy of the pages you visit (e.g. emails you read) to their local disk. Even after you logout of your session and leave the computer, someone comes along and tries to view a page in the browser history, it should re-check with the server first, which of course will return "this user is now logged out, show the login page instead".

However this is a problem with this whole setup related to attachments. When an email has an attachment, the content of the attachment might be in a form the browser doesn’t understand (e.g. Microsoft Word document). In that case, the browser has to save a copy of the attachment to the local disk, and then launch Microsoft Word to open the file.

Now in the case of the "View" link, the saving to disk would be done automatically into a temporary file storage area. However in IE, if you try and download an SSL document with the no-cache or no-store attributes set, IE will explicitly not save the file to disk, and then when it tries to launch Microsoft Word to read the file, you’ll get a "file does not exist" error or the like.

http://support.microsoft.com/kb/812935
http://support.microsoft.com/kb/815313

For other browsers, it appears they work around this problem by actually saving a copy to disk in the temporary storage area, but they delete the file when you close the browser (at least that’s what Firefox did when I tested). That still potentially does leave the file on disk for some time.

To ensure the best privacy possible, while still allowing people to view attached documents in "Public Terminal" mode, we’ve decided to do the following:

  • When you login with the "Public terminal" mode, we’ve removed the "View" link next to attachments. This solves two problems; the unexpected "file not found" in IE, and the privacy concern of storing attachments to disk in the temporary file area of other browsers
  • We’ve left the "View" link next to image attachments, because the web browser can display images itself, without launching a separate program, so it can obey the "no-cache"/"no-store" directives
  • With the "Download" link (which automatically brings up a "Save as…" dialog box), we’ve removed the "no-cache" and "no-store" settings, which means that IE will let you download it and save it somewhere so you can open it to view the document.

We like this solution because it makes things clearer to the user. In "Public Terminal" mode, if you want to view an attachment, you have to download it first, explicitly save it somewhere and then view it. The alternative approach of letting the browser do it either fails (IE), or causes an auto-save of the file to a temporary area which leaves it temporarily cached on the machine when the user doesn’t expect it.

Posted in News, Technical. Comments Off

TCP keepalive, iOS 5 and NAT routers

This post contains some very technical information. For users just interested in the summary:

If over the next week you experience an increase in frozen, non-responding or broken IMAP connections, please contact our support team (use the "Support" link at the top of the http://www.fastmail.fm homepage) with details. Please make sure you include your operating system, email software, how you connect to the internet, and what modem/router/network connection you use in your report.

The long story: The IMAP protocol is designed as a long lived connection protocol. That is, your email client connects from your computer to the server, and stays connected for as long as possible.

In many cases, the connection remains open, but completely idle for extended periods of time while your email client is running but you are doing other things.

In general while a connection is idle, no data at all is sent between the server and the client, but they both know the connection still exists, so as soon as data is available on one side, it can send it to the other just fine.

There is a problem in some cases though. If you have a home modem and wireless network, then you are usually using a system called NAT that allows multiple devices on your wireless network to connect to the internet through one connection. For NAT to work, your modem/router must keep a mapping for every connection from any device inside your network to any server on the internet.

The problem is some modems/routers have very poor NAT implementations that "forget" the NAT mapping for any connection that’s been idle for 5 minutes or more (some appear to be 10 minutes or more). What this means is that if an IMAP connection remains idle with no communication for 5 minutes, then the connection is broken.

In itself this wouldn’t be so bad, but the way the connection is broken is that rather than telling the client "this connection has been closed", packets from the client or server just disappear which causes some nasty user visible behaviour.

The effect is that if you leave your email client idle for 5 minutes and the NAT connection is lost, if you then try and do something with the client (e.g. read or move an email), the client tries to send the appropriate command to the server. But the TCP packets that contain the command never arrive at the server, but neither are RST packets sent back that would tell the client that there’s any problem with the connection, the packets just disappear. So the local computer tries to send again after a timeout period, and again a few more times, until usually about 30 seconds later, it finally gives up and marks the connection as dead, and finally sends that information back up to the email client, which shows some "connection was dropped by the server" type message.

From a user perspective, it’s a really annoying failure mode that looks like a problem with our server, even though it’s really because of a poor implementation of NAT in their modem.

However this is a workaround for this. At the TCP connection level, there’s a feature called keepalive that allows the operating system to send regular "is this connection still open?" type packets back and forth between the server and the client. By default keepalive isn’t turned on for connections, but it is possible to turn it on via a socket option. nginx, our frontend IMAP proxy, allows you to turn this on via a so_keepalive configuration option.

However even after you’ve enabled this option, the default time between keepalive "ping" packets is 2 hours. Fortunately again, there’s a Linux kernel tuneable net.ipv4.tcp_keepalive_time that lets you control this value.

By lowering this value to 4 minutes, it causes TCP keepalive packets to be sent over open but idle IMAP connections from the server to the client every 4 minutes. The packets themselves don’t contain any data, but what they do do is cause any existing NAT connection to be marked as "alive" on the users modem/router. So poor routers with NAT connections that would normally timeout after 5 minutes of inactivity are kept alive, so the user doesn’t see the nasty broken connection problem described above, and neither is there a visible downside to the user either.

So this is how things have been for the last 4-5 years, which has worked great.

Unfortunately, there’s a new and recent problem that has now appeared.

iOS 5 now uses long lived persistent IMAP connections (apparently previous versions only used short lived connections). The problem is that our ping packets every 4 minutes mean that the device (iPhone/iPad/iPod) is "woken up" every 4 minutes as well. This means the device never goes into a deeper sleep mode, which causes significantly more battery drain when you setup a connection to the Fastmail IMAP server on iOS 5 devices.

Given the rapid increase in use of mobile devices like iPhones, and the big difference in battery life it can apparently cause, this is a significant issue.

So we’ve decided to re-visit the need for enabling so_keepalive in the first place. Given the original reason was due to poor NAT routers with short NAT table timeouts, that was definitely an observed problem a few years back, but we’re not sure how much of a problem it is now. It’s possible that the vast majority of modems/routers available in the last few years have much better NAT implementations. Unfortunately there’s no way to easily test this, short of actually disabling keepalive, and waiting for users to report issues.

So we’ve done that now on mail.messagingengine.com, and we’ll see over the next week what sort of reports we get. Depending on the number, there’s a few options we have:

  1. If there’s lots of problem reports, we’d re-enable keepalive by default, but setup an alternate server name like mail-mobile.messagingengine.com that has keepalive disabled, and tell mobile users to use that server name instead. The problem with this is many devices now have auto configuration systems enabled, so users don’t even have to enter a server name, so we’d have to work out how to get that auto configuration to use a different server name
  2. If there’s not many problem reports, we’d leave keepalive off by default, but setup an alternative server name like mail-keepalive.messagingengine.com that has keepalive enabled, and for users that report connection "freezing" problems, we’d tell them to switch to using that server name instead
  3. Ideally, we’d detect what sort of client was connecting, and turn on/off keepalive as needed. This might be possible using software like p0f, but integrating that with nginx would require a bit of work, and still leaves you with the problem of an iPhone user that is usually in their office/home all day and uses a wireless network with a poor NAT router, would they prefer the longer battery life, or better connectivity experience.

I’ll update this post in a week or two when we have some more data.

Posted in News, Technical. Comments Off

DKIM signing outgoing email with From address domain

DKIM is an email authentication standard that allows senders of email to sign an email with a particular domain, and for receivers of the email to confirm that the email was signed by that domain and hasn’t been altered. There’s some more information about how DKIM is useful in this previous blog post. We’ve been DKIM signing all email sent via FastMail for the last 2 years.

In the original design of the DKIM, the domain that signed the email had no particular relationship to the domain in the From address of the email. This was particularly useful for large email providers like us. We have 10,000′s of domains, but would sign all email with just our "generic" messagingengine.com domain.

However this state of affairs is beginning to change. Standards like Author Domain Signing Practices explicitly link the domain of the email address in the From header of the email to the DKIM signing domain. Also recently Gmail has changed their web interface so that email sent with a From domain that’s different to the DKIM signing domain may be shown with an extra "via messagingengine.com" notice next to the sender name.

So we’ve now rolled out new code that changes how all emails sent through FastMail are DKIM signed. We always DKIM sign with messagingengine.com (as we always have), but we also now sign with a separate key for the domain used in the From address header where possible (see below for more details).

For most users, there should be no noticeable difference. For users that use virtual domains at FastMail, or have their own domain in a family/business, then when you send via FastMail, Gmail should no longer show "via messagingengine.com" on the received message (if your DNS is correctly setup, see below for more details).

For users that host their DNS with FastMail (eg. nameservers for your domain are ns1.messagingengine.com and ns2.messagingengine.com), this will "just work". We’ve generated DKIM public/private keys for all domains in our database, and automatically do so when new domains are added. We also publish the public keys for all domains via ns1.messagingengine.com/ns2.messagingengine.com.

In general if you can, we highly recommend hosting your DNS with us. For most cases the default settings we provide "just work", and if you need to customise your DNS, our control panel allows you to add any records of any type, without the arbitrary limitations many other DNS providers have.

However for users that host DNS for their domains externally and want to continue to do so, you’ll have to explicitly add the DKIM public key using your domain hosters DNS management interface. Unfortunately there’s 100′s of different DNS providers out there, so we can’t give specific directions for each one.

The general steps are:

  1. Login to your FastMail account and go to Options –> Virtual Domains (or Manage –> Domains for a family/business account).
  2. Scroll to the bottom, and you’ll see a new "DKIM signing keys" section. For each domain you have, you’ll see a DKIM public key.
  3. Login to your DNS provider, and create a new TXT record for each domain listed and use the value in the "Public Key" column as the TXT record data to publish.

Important: Note that you have to add the TXT record for the domain name shown in the DKIM signing keys section, which will be mesmtp._domainkey.yourdomain.com, do not add it for the base domain name yourdomain.com, that won’t work.

That should be it.

Note that initially each domain is marked as DKIM disabled (Enabled column = [ ]). While a domain is DKIM disabled, we won’t sign any sent emails. This is to avoid DKIM signing failures when the receiving side tries to lookup the public signature and fails to find it. We regularly check each domain to see if the correct public key TXT record is being published. If it is, we mark the domain in our database as "DKIM enabled" (Enabled column = [*]), and then begin signing sent emails.

So after you setup the records at your DNS provider, you should wait a few hours, then check this table again to see that the domain is now correctly DKIM enabled.

Some other technical notes:

There’s currently no way to change the public/private key used to sign emails or upload new ones. We always generate our own key pair for each domain and use the DKIM selector "mesmtp" to sign emails. This shouldn’t be a problem. If you’re transitioning from another provider to FastMail, you can use our custom DNS to publish the DKIM record of the previous provider with it’s selector as well as our own during the transition. Vice-versa for transitioning away from FastMail. The only other reason to change the selector would be if the private key was compromised, which should never happen as it’s stored securely in FastMail’s systems.

Posted in News, Technical. Comments Off

New XMPP/Jabber server

This is a technical post. Fastmail users subscribed to receive email updates from the Fastmail blog can ignore this post if they are not interested.

We’ve just replaced the XMPP/Jabber server we use for our chat service. Previously we had been using djabberd. While this worked well for us for the last few years, unfortunately it hasn’t been receiving much development recently. This means many newer XMPP extensions aren’t available.

We looked at a number of alternate server options: Tigase, Prosody, ejabberd, OpenFire. In the end, we settled on ejabberd because of it’s relative maturity, good administration documentation, it’s widespread use in existing large installations, the active development community and it’s support for multiple domains (in the newest version).

Fortunately our existing architecture separated the XMPP/Jabber server from the backend storage details of our system (eg. user lists, user rosters, chat logging, etc) with an HTTP JSON API. Because of this, it was fairly straightforward to completely remove djabberd, write the equivalent interfacing components for ejabberd and slot that into place. A perfect two month piece of work for our summer intern student Samuel Wejeus. Thanks Samuel!

That work has now been done, and yesterday we completely removed djabberd and replaced it with ejabberd. For users that use our chat service, there shouldn’t actually be any noticeable difference at this point, everything should just continue to work as it did, but with this new base we should be able to add more features in the future.

Posted in News, Technical. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 4,633 other followers