Goodbye old.fastmail.fm

Summary

In early 2009 we rolled out an updated web interface to all users. This is the interface you currently see when you login at http://www.fastmail.fm as most users do.

To give users time to transition, we continued to let people login to the old pre-2009 interface if they wanted to by going to the web address http://old.fastmail.fm. We’ve continued to support this for the last 3 years, but as only a few users are still using this interface, we decided to shut it down. For the last 3 months there’s been a prominent message each time you logged into http://old.fastmail.fm that noted this, and we’ve now fully shut down http://old.fastmail.fm.

Important point: This only affects users that were explicitly going to http://old.fastmail.fm to login. Users that use the regular interface at http://www.fastmail.fm (the vast majority) are completely unaffected.

The description below is a detailed history of the old interface and includes technical details about how much things have changed since 2009 and why maintaining http://old.fastmail.fm is no longer feasible.

Goodbye old.fastmail.fm

It’s been a long road, but the old FastMail web interface has finally reached the end of its life.

You can always access your email at https://www.fastmail.fm/ or try our beta site at https://beta.fastmail.fm/.

If you want to stop reading here, the things you need to know are security concerns and it was about to break anyway. Two good reasons why now is the right time to shut the old interface down.

If you want to know some of the technical background and the technologies that we have moved through over the years, read on!

A new infrastructure

Looking back through our version control history, my very first commit was on 2004-09-20! The original web interface commits are from early 2000, though it was started before then.

We switched version control systems at some point during 2005 from CVS to Subversion, which made branching much easier – but imported all our history, so we can still look back at those early changes.

One of our major branches was a huge infrastructure switch from Redhat 7.3 to Debian 3.1 (sarge), which we worked on throughout the second half of 2005. This was all merged back into the main branch, and we converted everything over in early 2006.

http://blog.fastmail.fm/2006/01/02/one-web-server-now-running-new-infrastructure-code/

We upgraded to Debian 4.0 (etch) during May 2007, soon after it came out.

A new interface

In 2008, Neil Jenkins (who is so awesome Opera hired him even before they had decided if they were going to buy FastMail) worked as a contractor over the summer to design a more modern web interface which would take advantage of the new features in web browsers.

We branched the code, and it diverged quite considerably. Features like cross folder searching required major internal datastructure changes, and the new interface had hooks all through the code. Our plan was always to retire the old code eventually.

We released the new interface to beta at the end of 2008, and rolled it out to everyone in 2009.

http://blog.fastmail.fm/2008/11/27/help-beta-test-new-web-interface/

http://blog.fastmail.fm/2009/02/17/new-interface-being-rolled-out/

An incompatible upgrade

Then in 2009, Debian 5.0 (lenny) came out. Lenny shipped with apache2 and mod_perl2, and no longer supported apache 1.3 or mod_perl version 1. We put quite a lot of work into porting our codebase forwards to apache2. Since "old" was going away soon, we didn’t duplicate the work there.

So we installed the new web servers on lenny, and kept a couple of servers called "oldweb" still running etch. It’s amusing now to remember all the hoops I jumped through to allow automatic installation of either system.

About this time we also had machines with enough memory that 32 bit address spaces were wasteful, particularly on the IMAP servers. We moved to running 64 bit kernels with 32 bit userland.

New hardware

In 2010, the Opera sale happened. One of the early steps was to replace some of our aging hardware with equipment that was better understood and supported by the Opera sysadmin department. This meant a new bladecentre for the non-storage systems (including web)

For a little while I had two blades (redundancy!) running "oldweb" code. That’s a huge amount of very under-utilised resource.

And, to be honest, managing new blades with ancient OS was a pain. Things didn’t work well. The configuration tools we built for the new hardware didn’t run on etch.

When we moved to Debian 6.0 (squeeze) and at the same time went fully 64 bit, it was time to do something about "old".

We also moved version control systems AGAIN in late 2010 – from subversion to git. The old web servers were left on subversion, because they weren’t getting much in the way of changes any more. One more "split" in how things were done.

Fully virtual

Rather than having to support "real hardware", I built an etch virtual machine. Everything else was running squeeze 64 bit, but we still had a full 32 bit etch install path just to support oldweb.

While all this was happening, there were occasional changes required to support changing database schemas, configuration mechanisms, and interaction with other parts of our system. At some point I just took a snapshot of the current tree and started a new git repository so we could archive the subversion server entirely.

Maintaining the virtual machines was a real pain though. They were run in the background on some of the web servers to free up the hardware for more demanding tasks. This meant changing the network interfaces to be bonding drivers, custom configuration, lots of pain. There were occasionally long outages as we changed things and then had to patch oldweb to catch up.

Worst of all, we were maintaining the ENTIRE stack – support daemons, log rotation, pop fetching…

Old lost features over time – we just couldn’t keep them working, so we ripped the code out. Particularly some of the more advanced configuration screens – and everything related to billing.

Single component

In the end the virtual machines were too much work. Our authentication system in particular had many changes under the hood, and it just wasn’t going to keep working. We had a couple of really bad problems with file storage, where we were sure that something "couldn’t happen", but then it turned out old was still doing things differently. Talking to the wrong databases, running the wrong queries. We seriously considered dropping old at that point, but I wanted to give it a bit longer.

So I build a chroot installation of etch on our web servers, and bind mounted the daemon sockets into the chroot. This allowed us to run just the web interface code itself on the old branch, while running everything else in the modern, managed, outside world. I built a custom init script which could set up all the necessary mountpoints (/proc, /dev, /var/run, even the tmpfs with mmaped caches was shared) – and forward ported more of the code.

This was built with debootstrap originally, but in the end it was getting unreliable even fetching etch packages, so I build a .tar.gz file with the filesystem for the chroot, and a fresh install just unpacked that. As we changed internal config systems, I kept "oldweb" up to date. A couple of commits every month.

So that’s brings us to today. An init script (apache-oldweb), a chroot environment with a snapshot of a Debian etch machine with apache 1.3 and mod_perl version 1 – running perl 5.8. Everything else is perl 5.10 or newer, so I even have to backport some idioms as I bring back the bits which it just can’t live without.

I have done basically all the "keeping old alive" for the past couple of years – for a smaller and smaller set of users who still log in there. Backporting everyone else’s changes as they impacted old.

And etch doesn’t have security support. Hasn’t for ages. Sure it’s in a chroot, but it still has access to everything.

The final straw

But there’s one thing which oldweb can’t survive. We are redesigning how our session management works. There are some great benefits – bookmarkable URLs, remote logout of stale sessions, reduction of password typing on annoying little smartphone keyboards.

Everything will change, and old would have just stopped working. It’s not worth the changes to make it work. Particularly with the larger gap between the two systems as time goes on.

Also, and even worse, old interface is exposed to the wider internet – and it has full read/write access to the database and all emails. If there are security problems, all our users are at risk – not only those who use it directly.

It’s no longer safe, and it was going to break beyond easy repair in a few days anyway. It is time.

Goodbye oldweb

Posted in News, Technical. Comments Off

One step forward, two steps back

It’s been a really bad week for me.  Backing out two significant pieces of work.  One only released recently, but the other having caused problems for an entire year, and I’m really sorry to those who’ve been sitting through them as we didn’t have the effort available to find the underlying cause.

PowerDNS

First the recent one.  We switched out DNS servers from tinydns to powerdns last week.  There were very good reasons for the switch, tinydns as-is doesn’t support IPv6, or DNSSEC, or zone transfers, or…

And the data file is built as a single giant database and synchronised to all our servers once per hour, so updates take some time to be made.

On the flip side – it’s rock solid!  It’s served us well for years.  So we put a lot of work into testing PowerDNS for the change.  Unfortunately, it wasn’t enough.  First SOA records were broken for subdomains, then DNS delgation didn’t work, and now that I’ve switched back, a problem with Chat server aggregation has gone away, so it was probably doing the wrong thing there too!

Anyway – powerdns got backed out.  The “pipe” backend that we were using just isn’t expressive enough, so we either need to find another way to do it, or find another path forward.  The good thing about PowerDNS is that it’s actively maintained, so we should be able to get somewhere here.

EJabberd

It’s much sadder to give up EJabberd.  Erlang is an interesting language, and the integration work was done by an intern last year – he did really good work.  The hard bit was that we needed support for the many thousands of domains that our customers host with us.  Ejabberd 2.x (the stable branch) just didn’t support it.  Ejabberd 3.x was going to, but was currently in alpha.  Looking at the development pace, I made the call to integrate with Ejabberd.  We did that.

But it’s been plagued with problems.  The chat logging service has been flaky, there have been “Malformed XML” disconnections which I suspect to be related to incorrect SSL renegotiations, but I haven’t been able to prove it.  I’ve spent far too much time looking at packet logs and trying to figure it out.

I’ve had long standing tickets about it, and kept saying “it’s getting better” – but seriously, upstream hasn’t made a single commit to ejabberd mainline since February this year.  They’re putting all their effort into the 2.x branch.

So I’m in the process of backporting our chat service to the DJabberd engine we used to use.  It’s not perfect either – it doesn’t have anywhere near the feature set that ejabberd has, and it’s not getting any more support.  The code is of OK quality, but it’s quite convoluted and written in many different styles which makes reading it tricky.  I’ve had to make two patches to get interoperability up to scratch with modern servers and support the multiple SSL certificates we now use.

It’s always sad to give up features, and to sideline hard work that you or others have done – but in the end we have been hurting customers by providing a sub-standard experience with chat.  So I’m hoping to put a line under that by the end of this week and be able to move on with good new things again.  At least a couple of us have some more Erlang experience now, and you never know when that might be useful.  It’s good just to understand different ways of thinking about code.

Posted in Technical. Comments Off

*.fastmail.fm certificate updated

The SSL certifcate for *.fastmail.fm (that is, http://www.fastmail.fm, and all other subdomains) was due to expire in a couple of months, so this morning we’ve updated it to a new one.

Because its not something we do often, we don’t have a nice little script automating the task. As a result I forgot to properly add the CA chain to the certificate so there was a period of about 30 minutes where some users with old browsers might have seen “invalid certificate” errors. That was fixed pretty quickly once we noticed and everything should be fine now.

The next certificate to expire has a few months left on it. I’ll make sure this is nicely automated before we have to update that one!

Posted in Technical. Comments Off

Chat server updated

We’ve had a couple of persistent problems with our XMPP chat server since upgrading to Ejabberd (3.0 pre-release with patches) last year.

  1. “Malformed XML” errors.  We suspect these were due to bugs with handling SSL.
  2. Missing IP address and SSL information in the LoginLog.
  3. ChatLog backend died, and chat logging stopped.

We have our awesome intern Samuel back again this year, and though he’s working with a different department this time – I managed to grab some of his time to work with me on fixing these long standing issues!

We think we’ve fixed number 1 – at least it hasn’t been seen since updating the upstream libraries.

We’re positive we’ve fixed number 2, by making sure the correct information is passed directly to the authentication functions rather than looking for it in the session object (which isn’t populated in time to be useful)

And as for number 3 – a restart fixed it.  I’m now monitoring the log files to look for what might cause it to die.  Unfortunately, by the time we looked at it the logs had already rotated away, so there was no evidence left!  I’ll make sure we catch the problem next time, if it happens again.
TL;DR – chat server is updated and should be more stable and give better information than before.

 

Posted in Technical. Comments Off

A story of leaping seconds

I’ve been promising to blog more about some technical things about the FastMail/Opera infrastructure, and the recent leap second fiasco is a good place to point out not only some failures, but also the great things about our system which helped limit the damage.

Being on duty

First of all, it was somewhat random that I was on duty. We’ve recently switched to a more explicit “one person on primary duty” from the previous timezone and time-of-day based rules. Partially so other people knew who to contact, and partially to bring some predictability to when you had to be responsible.

Friday 29th June was my last working day for nearly 3 weeks. My wife is on the other side of the world, and the kids are off school. I was supposed to be handing duty off to one of the Robs in Australia (we have two Robs on the roster – one Rob M and the other Rob N. This doesn’t cause any confusion because we’re really good at spelling). I had made some last minute changes (as you always do before going away) and wanted to keep an eye on them, so I decided to stay on duty.

Still, it was the goodbye party for a colleague that night, so I took the kids along to Friday Beer (one of the awesome things about working for Opera) and had quite a few drinks. Not really drunk, but not 100% sober either. Luckily, our systems take this into account… in theory.

The first crash

Fast forward to 3:30am, and I was sound asleep when two SMSes came in quick succession. Bleary eyed, I stumbled out to the laptop to discover that one of our compute servers had frozen. Nothing on the console, no network connectivity, no response to keystrokes via the KVM. These machines are in New York, and I’m in Oslo, so kicking it in person is out of the question – and techs onsite aren’t going to get anything I don’t.

These are Dell M610 blades. Pretty chunky boxes. We have 6, split between two bladecentres. We can easily handle running on 3, mail delivery just slows by a few milliseconds – so we have each one paired with one in another bladecentre and two IP addresses shared between them. In normal mode, there’s one IP address per compute server.  (amongst the benefits of this setup – we can take down an entire bladecentre for maintenance in about 20 minutes – with no user visible outage)

Email delivery is hashed between those 6 machines so that the email for the same user goes to the same machine, giving much better caching of bayes databases and other per-user state. Everything gets pushed back to central servers on change as well, so we can afford to lose a compute server, no big deal.

But – the tool for moving those IP addresses, “utils/FailOver.pl”, didn’t support the ‘-K servername’ parameter, which means “assume that server is dead, and start up the services as if they had been moved off there”.  It still tries to do a clean shutdow, but on failure it just keeps going. The idea is to have a tool which can be used while asleep, or drunk, or both…

The older “utils/FailOverCyrus.pl”, which is still used just for the backend mail servers, does support -K. It’s been needed before. The eventual goal is to merge these two tools into one, which supports everything. But other things keep being higher priority.

So – no tool. I read the code enough to remind myself how it actually works, and hand wrote SQL statements to perform the necessary magic on the database so that the IP could be started on the correct host.

I could have just bound the IP by hand as well. But I wanted to not confuse anything. Meanwhile I had rebooted the host, and everything looked fine, so I switched back to normal mode.

Then (still at 3:something am) I wrote -K into FailOver.pl. Took a couple of attempts and some testing to get it right, but it was working well before I wrote up the IncidentReport and went back to bed some time after 4.

The IncidentReport is very important – it tells all the OTHER admins what went wrong and what you did to fix it, plus anything interesting you saw along the way. It’s super useful if it’s not you dealing with this issue next time. It included the usage instructions for the new ‘-K’ option.

Incident 2 – another compute server and an imap server

But it was me again – 5:30am, the next blade over failed. Same problem, same no way to figure out what was wrong. At least this time I had a nice tool.

Being 5:30am, I just rebooted that blade and the other one with the same role in the same datacentre, figuring that the thing they had in common was bladecentre2 compute servers.

But while this was happening, I also lost an imap server. That’s a lot more serious. It causes user visible downtime – so it gets blogged.

From the imap server crash, I also got a screen capture, because blanking wasn’t enabled for the VGA attached KVM unit. Old server. This capture potentially implicated ntpd, but at the time I figured it was still well in advance of the leap second, so probably just a coincidence. I wasn’t even sure it was related, since the imap server was an IBM x3550, and the other incidents had been blades. Plus it was still the middle of the night and I was dopey.

Incident 3

Fast forward to 8:30am in the middle of breakfast with my very forgiving kids who had let me sleep in. Another blade, this time in the other blade centre. I took a bit longer over this one – decided the lack of kernel dumps was stupid, and I would get kdump set up. I already had all the bits in place from earlier experiments, so I configured up kdump and rolled it onto all our compute servers. Rob M (see, told you I could spell) came online during this, and we chatted about it and agreed that kdump was a good next step.

I also made sure every machine was running the latest 3.2.21 kernel build.

Then we sat, and waited, and waited. Watched pots never boil.  I also told our discussion forum what was going on, because some of our more technical users are interested.

Incident 4

One of the other IMAP servers, this time in Iceland!

We have a complete live-spare datacentre in Iceland. Eventually it will be a fully operational centre in its own right, but for now it’s running almost 100% in replica mode. There are a handful of internal users with email hosted there to test thing (indeed, back at the start, my “urgent” change which meant I was still on call was changing how we replicate filestorage backends around so that adding a second copy in Iceland to be production rather than backup didn’t lead to vast amounts of data being copied through the VPN to New York and back while I was away)

So this was a Dell R510 – one of our latest “IMAP toasters”. 12 x 2Tb SATA drives for data (in 5 x RAID1 and 2 hot spares) for 20 x 500Gb cyrus data stores, 2 x SSD in RAID1 for super-hot Cyrus metadata, 2 x quadcore processor, 48 Gb RAM. For about US $12k, these babies can run hundreds of thousands of users comfortably. They’re our current sweet spot for price/performance.

No console of course, and no kdump either. I hadn’t set up the Iceland servers with kdump support yet. Doh.

One thing I did do was create an alias. Most of our really commonly used commands have super short aliases. utils/FailOver.pl is now “fo”. There are plenty of good reasons for this.

Incident 5

One reason is phones. This was Saturday June 30th, and I’m taking the kids to a friend’s cabin for a few days during the holidays. They needed new bathers, and I had promised to take them shopping. Knowing it could be hours until the next crash, I set up the short alias, checked I could use it from my phone, and off we went. I use Port Knocker + Connect Bot from my Android phone to get a secure connection into our production systems while I’m on the move.

So incident 5 was another blade failing – about 5:30pm Oslo time, while I was in the middle of a game shop with the kids. Great I thought, kdump will get it. I ran the “fo” command from my phone, 20 seconds of hunt and peck on a touch keyboard, and waited for the reboot. No reboot. Couldn’t ping it.

Came home and checked – it was frozen. Kdump hadn’t triggered. Great.

As I cooked dinner, I chatted with the sysadmins from other Opera deparements on our internal IRC channel. They confirmed similar failures all around the world in our other datacentres. Everything from unmodified Squeeze kernels through to our 3.2.21 kernel. I would have been running something more recent, but the bnx2 driver has been broken with incorrect firmware since then – and guess what the Dell blades contain. Other brands of server crashed too, so it wasn’t just Dell.

Finding a solution

The first thing was to disable console blanking on all our machines. I’ve been wanting to do it for a while, but I down-prioritised it after getting kdump working (so I thought). By now we were really suspicious of ntp and the leap second, but didn’t have “proof”. A screen capture of another crash with ntp listed would be enough for that. I did that everywhere – and then didn’t get another crash!

Another thing I did was post a question to superuser.com. Wrong site – thankfully someone moved it to serverfault.com where it belonged. My fault, I have been aware of these sites for a while, but not really participated. The discussion took a while to start up, but by the time the kids were asleep, it had exploded. Lots of people confirming the issue, and looking for workarounds.  I updated it multiple times as I had more information from other sysadmins and from my own investigations.

Our own Marco had blogged about solutions to the leap second, including smearing, similar to what Google have done.  I admit I didn’t give his warnings the level of attention I should have – not that any of us expected what happened, but he was at least aware of the risk more than some of us cowboys.

Marco was also on IRC, and talked me though using the Perl code from his blog to clear the leapsecond flag from the kernel using adjtimex. I did that, and also prepared the script a bit for more general use and uploaded it to my filestorage space to link into the serverfault question where it could be useful to others.

By now I was the top trending piece of tech news, and scoring badges and reputation points at a crazy rate. I had killed NTP everywhere on our servers and cleared the flag, so they were marching on oblivious to the leap second – no worries there. So I joined various IRC channels and forums and talked about the issue.

I did stay up until 2am Oslo time to watch the leap second in.  In the end the only thing to die was my laptop’s VPN connection, because of course I hadn’t actually fixed the local end (also running Linux).  There was a moment of excitement before I reconnected and confirmed that our servers were all fine.  10 minutes later, I restarted NTP and went to bed.

The aftermath: corruption

One of the compute server crashes had corrupted a spamassassin data file, enough that spam scanning was broken.  It took users reports for us to be aware of it.  We have now added a run of ‘spamassassin –lint’ in startup scripts of our compute servers, so we can’t operate in this broken state again.

We also reinstalled the server.  We reinstall at almost any excuse.  The whole process is totally automated.  The entire set of commands was

# fo -a    (remember the alias from earlier)
# srv all stop
# utils/ReinstallGrub2.pl -r     (this one has no alias)
# reboot

and about 30 minutes later, when I remembered to go check

# c4 (alias for ssh compute4)
# srv all start
# c1 (the pair machine for compute4 is compute1)
# fo -m (-a is “all”, -m is “mastered elsewhere”)

And we were fully back in production with that server.  The ability to fail services off quickly, and reinstall back to production state from bare metal, is a cornerstone of good administration in my opinion.  It’s why we’ve managed to run such a successful service with less than one person’s time dedicated to sysadmin.  Until 2 months ago, the primary sysadmin has been me – and I’ve also rewritten large parts of Cyrus during that time, and worked on other parts of our site too.

The other issue was imap3 – the imap server which crashed right back during incident 2.  After it was up, I failed all Cyrus instances off, so it’s only running replicas right now.  But the backup script goes to the primary location by default.

I saw two backup error messages go by today (while eating my breakfast – so much for being on holiday – errors get sent to our IRC channel and I was still logged in).  They were missing file errors, which never happen.  So I investigated.

Again, we have done a LOT of work on Cyrus over the years (mostly, I have), and one thing has been adding full checksum support to all the file formats in Cyrus.  With the final transition to the twoskip internal database format I wrote earlier this year, the only remaining file that doesn’t have any checksum is the quota file – and we can regenerate that from our billing database (for limit) and the other index files on disk (for usage).

So it was just a matter of running scripts/audit_slots.pl with the right options, and waiting most of the day.  The output contains things like this:

# user.censored1.Junk Mail cyrus.index missing file 87040
# sucessfully fetched missing file from sloti20d3p2

# user.censored2 cyrus.index missing file 18630
# sucessfully fetched missing file from slots3a2p2

 The script has enough smarts to detect that files are missing, or even corrupted.  The cyrus.index file contains a sha1 of each message file as well as its size (and a crc32 of the record containing THAT data as well), so we can confirm that the file content is undamaged.  It can connect to one of the replicas using data from our “production.dat” cluster configuration file, and find the matching copy of the file – check that the sha1 on THAT end is correct, and copy it into place.

The backup system knows the internals of Cyrus well.  The new Cyrus replication protocol was built after making our backup tool, and similar checks the checksums of data it receives over the wire.  At every level, our systems are designed to detect corruption and complain loudly rather than blindly replicating the corruption to other locations.  We know that RAID1 is no protection, not with the amount of data we have.  It’s very rare, but with enough disks, very rare means a few times a month.  So paranoia is good.

Summary

All these layers of protection, redundancy, and tooling mean that with very little work, even while not totally awake, the entire impact on our users was an ~15 minute outage for the few users who were primary on imap3 (less than 5% of our userbase) – plus delays of 5-10 minutes on incoming email to 16.7% of users each time a compute server crashed.  While we didn’t do quite as well as Google, we did OK!  The spam scanning issue lasted a lot longer, but at least it’s fully fixed for the future.

We had corruption, but we have the tools to detect and recover, and replicas of all the data to repair from.

Posted in Technical. Comments Off

Building the new AJAX mail UI part 2: Better than templates, building highly dynamic web pages

This is part 2 of a series of technical posts documenting some of the interesting work and technologies we’ve used to power the new interface (see also part 1, Instant notifications of new emails via eventsource/server-sent events). Regular users can skip these posts, but we hope technical users find them interesting.

As dynamic websites constructed entirely on the client side become de rigueur, there are a number of templating languages battling it out to become the One True Way™ of rendering your page. All follow essentially the same style: introduce extra control tags to intersperse with HTML. But if we go back to basics, HTML is simply a way of serialising a tree structure into a text format that is relatively easy for humans to edit. Once the browser receives this, it then has to parse it to generate an internal DOM tree representation before it can draw the page.

In an AJAX style application, we don’t transmit HTML directly to the browser. Instead, we generate the HTML on the client side, and often update the HTML in different parts of the page over time as the user interacts with the application. As string manipulation for building HTML from data objects is hard to write and error-prone, we normally use a template language and a library that compiles these snippets into code; this executes with a data context, producing a string of HTML that may be set as an element’s innerHTML property. The browser then builds a DOM tree, which we can query to update nodes and add event listeners.

There is, however, another alternative for building the DOM tree: directly in JavaScript. Modern browsers are very fast at parsing and executing JavaScript. What if, with the help of a liberal sprinkling of syntactic sugar, we were to build the DOM tree in code instead? Start by considering a simple function el to declare an element.

el( 'div' )

OK, so far we’ve just renamed the document.createElement method. What next? Well, we’re going to want to add class names and ids to elements a lot. Let’s use the CSS syntax which everyone knows and loves.

el( 'div#id.class1.class2' );

Hmm, that’s quite clean and readable compared to:

<div id="id" class="class1 class2"></div>

What else? Well, there may be other attributes. Let’s pass them as a standard hash:

el( 'div#id', { tabindex: -1, title: 'My div' })

That’s pretty neat. Let’s have a quick look at the html for comparison:

<div id="id" tabindex="-1" title="My div"></div>

A node’s not much use on its own. Let’s define a tree:

var items = [ 1, 2, 3, 4 ];
el( 'div#message', [
    el( 'a.biglink', { href: 'http://www.google.com' }, [
        'A link to Google'
    ]),
    el( 'ul', [
        items.forEach( function( item ) {
            return el( 'li.item', [ item + '. Item ] );
        })
    ]),
    items.length > 1 ? 'There are lots of items'.localise() + '. ' : null,
    'This is just plain text. <script>I have no effect</script>'
])

So what have we achieved? We’ve got a different way of writing a document tree, which is essentially very similar to HTML but changes the punctuation slightly to make it valid JavaScript syntax instead. So what? Well, the point is this readable declaration is directly executable code; we just need to define the el function: https://gist.github.com/1532562. As it’s pure JS, we can replace static strings with variables. We can easily add conditional nodes, as shown in the example above. We can call other functions to generate a portion of the DOM tree or use array iterators to cleanly write loops. Wrap it all in a function and we can pass different data into the function each time to render our DOM nodes… we have ourselves a template.

Performance

While innerHTML used to be much faster than JS DOM methods, this no longer holds for modern browsers. Let’s have a look at a benchmark: http://jsperf.com/innerhtml-or-dom/4

Here we have four different methods of rendering the same bit of HTML. This is a real-world snippet, taken from a core part of our new webmail application (https://beta.fastmail.fm), with just a few class names changed. Let’s first look at the hand-optimised innerHTML method and hand-optimised DOM method. In Chrome the DOM version is over 50% faster than using innerHTML and in Safari it’s 45% faster. Firefox is just as fast with either, while Opera is marginally faster using innerHTML. IE is still twice as fast using innerHTML rather than DOM methods. Perhaps most interesting though is to look at mobile browser performance. On desktop, computers are fast enough these days that the performance differences are less of an issue. On mobile it’s crucial, and here we find that the DOM method is anywhere from 45% to 100% faster in mobile WebKit browsers, such as Safari on the iPhone and the default Android browser, and level with innerHTML on Opera Mobile.

A few things to note before we look at the real-world tests. Firstly, for maximum speed, the innerHTML method is assuming all text is already escaped; a very dangerous assumption. The DOM method on the other hand needs to make no such assumptions, as text is added to the DOM tree by creating text nodes. Since the text is never parsed as HTML, there is zero chance of accidentally injecting a malicious script tag. Secondly, if you need a reference to any of the DOM nodes you’re creating (for example to save for updating later or to add event listeners), with the innerHTML method you must query the DOM after you’ve constructed it. With direct DOM construction, you already have the node reference; you just save it as you create it.

These hand-optimised functions are fast, but unmaintainable and a pain to write. Let’s move on to something we would use on a real website.

Handlebars is a popular JS templating language, and claims to be one of the fastest around. It produces a string for use with innerHTML to construct the DOM elements. Let’s compare that to the JS declarative approach I outlined above (which I’m going to call Sugared DOM). Compared to the raw methods, the Sugared DOM was more-or-less equal in performance to the hand-optimised innerHTML in Chrome and Safari, even on the iPhone. It’s equal to or faster than Handlebars templates (sometimes by a significant margin) in all browsers other than IE, and crucially on mobile browsers it’s anywhere from 50% to 100% faster. Note too that the initial compilation time for Handlebars templates is not included in these benchmarks.

Conclusion

On almost all modern browsers the Sugared DOM method is faster than normal templates, even when ignoring the compile-time cost the latter have. There are other benefits as well:

  • Easy to debug (the template declaration is the code).
  • The sugar code is much smaller than any decent templating library.
  • No need to query the DOM, as you can just save references to nodes you’ll need later as you create them. This is faster and may remove the need for a whole JS library you currently use (like Sizzle).
  • No escaping worries; zero chance of XSS bugs. When you include a string in the declaration it is explicitly set as a text node, so is never parsed as HTML. <script> tags are harmless!
  • No extraneous white-space text nodes. White space between block-level nodes in HTML does not affect the rendering, but it does add extra nodes to the DOM. These can be a pain when you’re manipulating it later (the firstChild property may not return what you expect) and increases the memory usage of the page.
  • As it’s pure JS, the templates can be easily included inline as part of view classes that also handle the behaviour of the view, or kept in separate files.
  • JSHint will validate your syntax; much easier than tracking down syntax errors from a template’s compiler.
  • Flexibility to use the full power of JS; easily call other functions to generate parts of your DOM tree, localise a string, or do whatever else you like.

What are the downsides? Well, it’s slightly slower in Internet Explorer (although still plenty fast enough in real world use) and the difference in syntax to HTML may take a little time to become accustomed to, especially if templates are written by designers rather than coders (then again every template introduces its own syntax, so I’m not sure there’s much difference here). And, err, I think that’s about it.

It’s time to ditch HTML based templates. Embrace the DOM, and enjoy your powerful, fast and readable new way to render pages.

Written by Neil Jenkins

Posted in News, Technical. Comments Off

New File Storage backend

You may notice that the Files screen loads quite a lot faster now, particularly if you have many folders.

The File Storage backend has been in need of an overhaul for a long time.  In order to have reliable cache expiry, it was quite single threaded.  We have been throwing around various ideas about how to make those changes for a while, and today I finally set a full day aside, ignoring all other issues, and tested a few things.

One major thing has changed since I first wrote the VFS code 6 years ago.  We’re not constrained by memory for metadata any more.  The VFS had small limits set to avoid blowing out the memory on an individual web process.  Well.  Our smallest web server has 24Gb RAM now.  The newer ones have 48.  The smallest DB server has 64Gb RAM.  There’s no point in caching hot data to disk, because it will be in memory anyway.

So the eventual change was to throw away all the caching layers except one very temporary in-memory one.  There were 1 disk (tmpfs) and two in-memory layers of caching before, so it probably actually saves memory anyway.

The code was also very general, which is fine – but a couple of carefully thought out queries later, I could make one DB fetch to get the full directory tree, plus metadata, and pre-populate the in-memory cache with the fields it was about to ask for.  This, again, is much more efficient than pulling the data from a local cache and checking it for staleness.

The end result – faster response, simpler code, and a few bugs (particularly with long-lived FTP connections) cleared up.

I also backported all the changes to the oldweb interface, so attachments on the compose screen still work, and the Files screen there still works.

The take-home lesson from all this, keep it simple stupid.  The caching complexity isn’t needed any more, if it ever was, and the simpler architecture will help.  I didn’t even have to make any DB schema changes (except dropping a couple of no-longer-used Cache management tables)

There should be no user-visible changes from any of this.  The APIs are all identical for our application layers: webdav, ftp, websites and the Files screen are all the same.

Posted in Technical. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 5,288 other followers