Apple mail “bug” turns out to be user script after all

This is a technical blog post which gives updates on the previous post. There is no need to read this unless you’re particularly interested in IMAP protocol issues and/or gossip.

Of course it’s hard to give updates quite the same level of press that the original post got. The internet does outrage far better than it does corrections.

The OS X Mail team have been really helpful in tracking this down. I put them in direct contact with the user to do further debugging. A very embarassed user discovered an applescript he wrote years ago to move mail from OS X Mail’s “semantic junk” folder to the real Junk folder at FastMail where our bayes trainer could learn from it.

What has changed is that OS X Mail now correctly detects the \Junk special-use on the folder at our server, and sets the semantic trash to be that folder – meaning he was moving messages from that folder to the same folder. Which raises another interesting question:

Could we stop this happening again?

There are three main places I can see that it could have been stopped:

  1. OS X Mail when it detected a “move” from a folder to the same folder and not performing it
  2. FastMail detecting a copy to the same folder and rejecting it
  3. FastMail detecting that the same message had occurred “too many times” in the past and rejecting the copy

Interestingly, none of these would have stopped the infinite loopyness. The script was pretty basic, and would have found messages in “Junk Mail” every time it ran.

There’s no way to avoid infinte loops in a sufficiently powerful language. You solve the “halting problem” when the user’s computer catches fire due to excess CPU usage, but otherwise it will just sit there eating power and making the computer feel slow.

With option (1) the reject would have been local, and minimal cost.

Option (2) costs more – because it causes a network round-trip each time, and it breaks theoretically useful usage (duplicating an email… not entirely sure how that’s useful in the case where you KNOW that it’s the same message – but the standard doesn’t say to reject it).

Option (3) is less breakage than option (2), because it only blocks the copy when it detects multiple past copies – but it’s expensive for the server – Cyrus doesn’t keep an index by GUID on the mailbox, so it would have to detect “copy to self” and do a full pass keeping a counter for each message and seeing if there were already “too many copies”. It’s more code and more complexity. It does bring a level of robustness against weird behaviour though.

Who should fix it?

Obviously, I think that client authors should be careful in what they send. They have the most information about the users’ intent, and the most ability to give useful feedback to the user about why the action is being rejected.

But I know it’s tilting at windmills. The OS X Mail team have been fantastic now that I’m in contact with them, but they are one of many clients, and it only takes one.

For now, I’ve put in a syslog statement that tells me how often clients are copying/moving messages to the same folder. The answer is – quite a lot. Some are OS X Mail (including older versions). Some don’t identify themselves, but I’ve found at least Outlook 14 in the “X-Mailer” fields of Sent Messages from the same IP address… so that’s a clue.

I’m guessing I really have to implement some variant of (3) if I want to not break any existing users. Hopefully the OS X Mail team will implement (1).

An aside on “detecting what’s happening at the server”

OS X Mail isn’t using the recently standardised “MOVE” command, it’s using the “COPY” command followed by “EXPUNGE” on the source messages – which is part of the original standard, and works on every server.

If we answered success to the COPY but didn’t actually create new copies of the message, it would mean that the messages would be silently deleted a moment later. Handy in this case, but hardly good behaviour in general! A user could accidentally wipe their entire INBOX.

This is a general problem that I have with IMAP. It’s a very imperative language. Rather than describing what you want done, you send explicit instructions against a very rigid model. Not only that, but there have been tons of extensions. You can use them if they exist – but a server can advertise itself as “IMAP” and support none of them, not even really useful basic things like UIDPLUS. Check this out:

http://imapwiki.org/Specs

Which means that clients either have to have two codepaths, one for servers with the extension, and one for servers without. Either that or just use the “without” codepath for every server.

If clients do have two different codepaths, the chance of bugs is more than doubled, because they will interact in interesting ways. Another issue I debugged this week was an issue with Thunderbird 24.0.1 not noticing that messages had been deleted from the server. The server was sending the right data, but Thunderbird wasn’t recognising it. Turning off support for CONDSTORE fixed it, which is why it “worked” on other servers which didn’t have CONDSTORE support, or didn’t have it switched on.

So the clients try to convert the intentions of the user into changes on the server, and they have to do so through a set of very small commands which each do one thing. There’s no way for the client to say “I need all the following information”, or “here’s the transformation do the data model that the user requested”. And then the server, if it wants to optimise at all, has to guess the intent from the commands as they come in. Some servers try to read ahead to see if the client has pipelined (made multiple requests before waiting for the answers). Our server doesn’t – it just takes advantage of the fact that at least the files are “hot” in memory, so accessing them will be faster next time.

Some examples:

Unread counts? Have to ask for them separately for every folder. If you want to display all the folders with unread messages, you need to first get the full listing of folders, then for each one request the unread count in a separate command.

Want the “first text part of every message”? Good luck – you need to fetch the bodystructure for each, and either combine those for which the text part is the same number into separate batches, or just do one fetch per message.

So it’s not surprising that IMAP clients have bugs. There is some necessary complexity to the problem of accessing email remotely – but plenty of the complexity has been caused by standards which have the client doing too much telling the server “how to do it” rather than “what needs to be done”.

Posted in Technical. Comments Off

Mac OS 10.9 – Infinity times your spam

UPDATE: the cause of the “infinite copy” bug was tracked down to a user side applescript. I have written a separate blog post about it.

This is a technical blog/rant. Users of FastMail who don’t use Apple’s email clients can safely skip it.

I’ve spent quite a lot of the past few days dealing with bugs caused by the Mail app in Apple’s new Mavericks update.

Apple’s mail client has been buggy with IMAP connections forever. It was infamous a couple of years ago for creating folders called INBOX.INBOX.INBOX.INBOX.INBOX.INBOX (until it hit the mailbox limit). We now block those at create time because they were causing “interesting” problems as well as being confusing.

This release has other interesting bugs that I’ve looked into in the past couple of days. When you uncheck the “keep a copy of my sent email on the server” checkbox, it auto-rechecks itself. I confirmed that report myself on our test laptop.

I also confirmed that the ‘Archive’ folder (special-use \Archive) doesn’t appear in the folder listing, and neither do any subfolders if you have any (one of my accounts does – one per year for the past *mumble* years).

So we know it’s not the breakfast of champions. That’s not a giant surprise, it’s never been the strong point of that OS. The previous revision has problems too:

229.12 UID STORE 588201 +FLAGS.SILENT ($Junk Junk JunkRecorded)
229.12 OK Completed 
230.12 UID STORE 588201 -FLAGS.SILENT ($NotJunk NotJunk)
230.12 OK Completed 
231.12 UID EXPUNGE 588201
231.12 OK Completed 
232.12 UID STORE 588201 +FLAGS.SILENT (\Deleted)
232.12 OK Completed 

Anyone who can read IMAP can see that it tries to expunge the message BEFORE the \Deleted flag is set, which is pointless. UID EXPUNGE only deletes messages with the \Deleted flag set.

http://tools.ietf.org/html/rfc4315#section-2.1

I found _that_ one because our web interface expunges all \Deleted items, so users noticed they only got the expected behaviour across multiple clients if they left the web interface open at the same time.

Millions of messages

But this doozy takes the cake. I found it nearly a week ago when we had an IO error because a user’s cache file was overflowing the 32 bit offset counter that still exists in Cyrus (I have an experiemental branch with 64 bit offsets, but it’s not ready for production yet)

They had 71 unique messages in their Junk Mail folder, but included expunged messages (we keep them for a week for backup purposes) there were over 1 MILLION separate entries in the index file. We de-duplicate on store, so the fact that there were over 100,000 copies of the most prolific message in the index didn’t totally flood the disk.

I noticed then that they were using 10.9’s mail app:

3.19 ID ("name" "Mac OS X Mail" "version" "7.0 (1816)" "os" "Mac OS X" "os-version" "10.9 (13A603)" "vendor" "Apple Inc.")

I wiped the expunged messages from the cache, emailed the user, and left it at that.

This morning I checked again, there were nearly a million messages again, so I enabled telemetry on the account and emailed the user a second time.

Here’s what I saw in the telemetry (one of many):

44.18 SELECT "INBOX.Junk Mail" (CONDSTORE)
[...]
45.18 FETCH 1:* (FLAGS UID MODSEQ) (CHANGEDSINCE 213634)
45.18 OK Completed (0.000 sec) 
46.18 IDLE
+ idling 
DONE
46.18 OK Completed 
47.18 CHECK
47.18 OK Completed 
48.18 UID COPY 3360991:3361069 "INBOX.Junk Mail"
* 158 EXISTS 
* 79 RECENT
48.18 OK [COPYUID 1318456612 3360991:3361069 3361070:3361148] Completed

Yes you read that right. It’s copying all the email from the Junk Folder back into the Junk Folder again!. This is legal IMAP, so our server proceeds to create a new copy of each message in the folder.

It then expunges the old copies of the messages, but it’s happening so often that the current UID on that folder is up to over 3 million. It was just over 2 million a few days ago when I first emailed the user to alert them to the situtation, so it’s grown by another million since.

The only way I can think this escaped QA was that they used a server which (like gmail) automatically suppresses duplicates for all their testing, because this is a massively bad problem.

I discovered my second attempt to contact the user about this problem in their Junk folder tonight. 10 times already!

Given that my colleague had just been paged by high disk usage on that user’s server – a usage which was growing fast, and which got reduced massively by expunging old deleted messages… it was time to lock the account until the faulty email client is disabled. We don’t lock user accounts lightly, but running a malfunctioning piece of software which is affecting other users and resisting attempts to contact qualifies, and a promise to disable the faulty software will be enough to resume service.

Yes, Mail.app was cleaning up after itself, but we keep deleted emails for a week, and even though it wasn’t using disk space, over a million emails were using enough meta database space that a disk had filled up. There are only 79 actual emails in this folder with a total usage of about 2MB, yet the meta files:

91M cyrus.index
906M cyrus.annotations
1.2G cyrus.cache

Over 2GB of disk usage.

Solving problems

The sad thing is – there are about 600 copies of Cyrus on our production farm, and I can roll out a new copy in about 5 minutes. There are umpteen million copies of Mail.app out there, and I can’t get them fixed on any particular schedule – so if this happens with more than one user the only real solution that I have is to code a workaround directly into our server to protect our other users.

We already wrote a special case for another one of Apple’s brilliant ideas – make the search box default to a full text search of every mailbox. The most expensive possible option for the server.

We have rate limits for other things, but we’ve never considered needing a rate limit for the COPY command – it would usually hit quota, but because these messages are expunged as fast as they are created, the quota doesn’t catch this issue.

The 4 million message 32 bit limit of the UID field will become interesting on that folder soon too, which is another thing we’ve never hit in production before. The workaround here is at least known – create a new folder, copy the messages across, delete the old folder, and rename the folder into place with a new UIDVALIDITY and new messages. – as many people have pointed out to me, it’s 4 billion, not 4 million – so much for last minute ideas when writing late at night. Sorry for the confusion. It would take a lot longer then a few days to hit the limit!

Posted in Technical. Comments Off

iOS 7 Mail App uses multi-folder body searches by default

This is a technical post. Regular FastMail users subscribed to receive email updates from the FastMail blog can just ignore this post.

We’ve recently been testing out the Mail application in iOS 7 and found that there’s been a couple of significant changes, especially when you search your email. If you run an IMAP server with users that connect using the iOS Mail application, you might be interested in what we found. The two main ones are:

  1. Rather than searching the current folder, by default, it searches all folders. To do that, it opens many IMAP connections at once for searching each folder concurrently. We’re not sure on the upper bound on the number of connections it will make, but saw at least 10 in one case.
  2. Rather than searching the Subject/To/Cc/From fields, by default, it searches all those fields as well as the message body.

Both of these changes are actually great for the user experience, but they create potentially large headaches for IMAP server maintainers.

Clients making multiple IMAP connections at once isn’t new, but the number of potential IMAP connections an iOS client might now make is large. Some administrators limit the total number of IMAP connections a user can make at once. They might have to raise this or iOS might start returning unexpected results.

The bigger concern is the body searching. The search string iOS now sends is:

tag UID SEARCH RETURN (ALL) (OR FROM "Foo" (OR SUBJECT "Foo" (OR TO "Foo" (OR CC "Foo" BODY "Foo")))) NOT DELETED

IMAP search semantics suggest that a body search is supposed to be a pure sub-string search (BODY <string>: Messages that contain the specified string in the body of the message). Depending on your IMAP server, your message bodies may or may not be indexed in a way that allows sub-string searching. If they’re not, then a BODY search is potentially very expensive, requiring every message to be opened and searched individually. Doing that simultaneously across many folders might generate a sudden and significant IO load on a server.

Our plan at FastMail is to detect iOS clients, and convert all searches into FUZZY searches. This causes matches to be done on “terms” rather than pure sub-strings, but allows us to use our xapian powered index which should make matching and fetching results much, much quicker.

Posted in Technical. Comments Off

Fastmail uses perfect forward secrecy with https/TLS connections

There’s been a number of articles recently about perfect forward secrecy (PFS). The main aim of PFS is to ensure that even if the private SSL/TLS key for www.fastmail.fm was ever compromised, it would still be impossible to decrypt any existing captured traffic between users and our server. If you’re looking for more information, the linked articles above are worth reading to get a better overview. For PFS to work, both the server (us) and the client (your web browser) must support it.

Fastmail has supported PFS via ECDHE for some time now (since July 2012). Unfortunately a few browsers don’t support ECDHE.

Today we’ve updated our ciphers to the best practice recommended by SSL Labs. Using the SSL Labs site tester on www.fastmail.fm shows that we now support PFS on all major browsers except for IE 8 on Windows XP, which has no support for PFS and so can never support it.

We’re pretty sure that this change won’t have any compatibility issues with old clients (which should fall back to older ciphers), but we’ll keep an eye out in case there’s any reported problems.

Posted in News, Technical. Comments Off

Push events, NAT TCP connection timeouts, and device sleep

This is a technical post. Regular FastMail users subscribed to receive email updates from the FastMail blog can just ignore this post.

When we released the new user interface last year, one of the improvements included was push updates when new emails arrived.

In theory, push events are conceptually quite easy to do. We open a connection from your web browser to the server (see this blog post for details), then when a new email arrives, we send a message down the open connection to let the browser know. It then fetches the details about the new email(s) and refreshes the display.

Unfortunately, in the real world, it’s not quite that easy. The biggest problem is that when a mailbox is mostly idle (no new mail arriving), the connection from the browser to the server will be idle. While this shouldn’t be a problem, it turns out it often is.

As we have noted before, some of our users are behind NAT gateways/stateful firewalls that have short state timeouts. If you leave a TCP connection idle for too long (variable from 2 to 30 minutes depending on the device), these start dropping any new packets on the connection.

In the case of a push connection from the server to the client, this is particularly bad. When a new email arrives, the server will try and send data to the client, and then be told the connection is dead at that point. That’s fine for the server, it can then clean up the connection. However, the client will never see any data from the server, and neither will the client ever know that the gateway/firewall has broken the connection. The client will think it is still connected to the server and has no way of knowing that the connection has actually been broken. This is purely a consequence of the way the TCP protocol works. The only way for the client to be able to tell the connection is broken is to send some data down the connection, and there are only 2 ways that can happen.

  1. If the client has enabled TCP keepalive on the socket. Currently only Chrome on Windows does this.
  2. If the client sent some data down the connection to the server. Unfortunately the eventsource specification doesn’t provide any way to do this; it basically assumes the underlying TCP connection is always reliable and only the server can send to the client.

One way to try and work around this issue is for the server to send regular “ping” events to the client, sufficiently often that the gateway/firewall knows the connection is still alive. This is relatively straightforward to do, but causes other problems.

If the ping events come too fast, it can cause some clients to never go into sleep mode. For instance, we used to send ping events every 60 seconds. It was noted that on an iPad if you left the FastMail webpage open in Safari and put the iPad down, the iPad itself would never actually go to sleep. The screen would stay on, draining the battery very quickly.

Because of that, we decided to go the other direction and disabled the ping events, but that ends up back at the other end of the scale where sometimes push just seems to randomly stop working.

As there’s no perfect solution to this problem, we’re now changing again to a new trade off.

  1. The server will send regular “ping” events to the client at 5 minute intervals. This should be enough for most gateways/firewalls to keep the connection open, but long enough apart to allow devices to go to sleep.
  2. If the client doesn’t see a ping event after 6 minutes, it assumes the connection has died, drops the existing connection and creates a new one. This should at least allow push events to work to some extent on connections with gateways/firewalls with low timeouts.

This change has now been rolled out everywhere. Based on initial testing, we think that this time we’ve got the balance between theory and reality right.

Posted in Technical. Comments Off

Intermittent bayes db corruption resolved

This is a technical post that describes the history and recent efforts to track down a bug that was corrupting some users bayes databases. Fastmail users subscribed to receive email updates from the Fastmail blog can ignore this post if they are not interested.

Over the past few years, we’ve had sporadic reports of users bayes databases being corrupted and reset back to empty. When this happened, it would cause email delivery for that user to fall back to using the global bayes database, which decreased the overall accuracy of their spam detection until they retrained the database with more spam and non-spam messages.

I had tried multiple times to track down what was causing this issue, but each time with no luck. Each time the problem occurred, there was an error message in the logs of this form.

bayes: bayes db version 0 is not able to be used, aborting!

Often searching the internet for an error message will find other people that have had the same problem and tracked down the solution, but in this case it didn’t. Each time I tried to work through the code to see what was going wrong, I reached a dead end and couldn’t see any obvious problem.

Since the corruptions were very intermittent and losing a bayes database isn’t critical, doesn’t cause email to be lost or inaccessible, and can be rebuilt just by reporting email as spam/non-spam again, tracking this down was always a bit of a lower priority issue.

Recently though, after one more corruption report too many, I decided once and for all to track down what was causing it. Bit by bit over the course of several weeks, I added more and more logging information to the server code to track down where in the code the problem was occurring.

The logging results proved to be very odd. In the vast majority of cases it showed that writing to a particular database worked fine, but every now and then, it caused data to be lost. Eventually I managed to create a reproducible test case. It turned out to be very odd issue because performing a particular programming action with a database library worked fine the first 5 or 6 times, but on the 6th or 7th, it would cause data to become lost. Clearly something odd is happening in the lower level library code.

Fortunately there was a straight forward workaround to the problem, so I’ve now patched our code with the workaround, and over the last few weeks I’ve monitored the logs which show the original error message above has completely disappeared and no databases are being corrupted any more.

I’ve reported bugs to the underlying modules causing the problems, so hopefully long term they’ll fixed as well.

https://rt.cpan.org/Public/Bug/Display.html?id=83060

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6901

Posted in Technical. Comments Off

Update to DNS hosting

We’ve rolled out a change to our DNS hosting abilities to switch our backend from tinydns to powerdns. We’d previously tried this change once before but had some problems and had to roll back. After some more development work and testing, we believe we’ve fixed all these issues and so have moved forward to powerdns again.

This change should initially be invisible to users and things should continue to work as they were. In the long term, it will allow us to support more features and faster updates to DNS in the future.

Posted in Technical. Comments Off

Inter-tab communication using local storage

A few weeks ago we launched our new webmail service for all users at FastMail. Once being used by a wider audience, we of course received reports of a few edge cases our testing hadn’t managed to uncover. One of the more interesting issues we discovered came from this use case: our user liked to scroll down his inbox, opening each email he wanted to read in a new tab in the background. Then he would go through the tabs, closing each one as he was done with it. So far, so good. Except, in Chrome, his browser of choice, as soon as about 5 tabs were open, the rest failed to load, and the earlier ones then started having communication errors as well.

A quick bit of research and testing yielded the problem: Chrome limits itself to a maximum of 6 concurrent connections to a single origin across the whole browser. Each tab was loading a full instance of the mail application, which meant it was creating an EventSource object and connection to our push server, to be notified of new deliveries (see this previous post for how that works). Since these connections are permanent (that’s the whole idea!), opening lots of tabs quickly used up all the available connections, with none left to fetch any actual data. To the user, this appeared as “Could not connect to server” error messages.

The solution to this problem was not immediately obvious. Ideally, we would like to maintain a single push connection and share it between the tabs, but there’s no API for getting a reference to other tabs or windows in the browser, even if they’re pointed to the same domain. Then I remembered that setting a property on local storage triggers a “storage” event on the window object of every open tab with the same origin. This, I realised, could be used to synchronise behaviour across tabs.

The concept is fairly simple. Only one tab keeps a push connection; we call this the master tab. When it receives a push event, it broadcasts it by setting the event as a property on local storage called “broadcast”. When a tab receives the storage event for this key, it reads the JSON-encoded event object from local storage and processes it as though it had been received via an EventSource object.

The tricky part comes in coordinating between the tabs who should be master. The master tab also sets a value called “ping” on local storage roughly every 30 seconds to the current time stamp. When a tab first loads it checks for this value; if it is greater than 45 seconds ago it presumes there is no current master, so it becomes master. Otherwise, it becomes a slave. However, whilst it is a slave, it continuously monitors for storage events with a key of “ping”, and if it hasn’t heard a ping within a 45 second period, it takes over as master. This switches control to another tab when the master tab closes. On browsers supporting the “unload” event we can make the changeover happen pretty much instantly, by setting the “ping” value to 0 in local storage when the tab is closed.

This all works very well, but there’s one problem remaining: race conditions. There is no API for taking out an explicit lock on local storage, so the spec advocates the use of a per-origin mutex which would be acquired by scripts once they try to access the storage, and then released when the script finishes. Not all browsers have adopted this. The Chrome developers, for example, have decided the performance penalty is too great. Therefore, in some browsers, it is possible for scripts in different tabs to interleave such that, for example, each tries to take master at the same time, then each notices another has taken it so none end up as master! The solution we have adopted is to add a random component to the delay between pings and waiting for pings. This makes it unlikely that two tabs will both attempt to take master at the same time. Of course this can still happen, but should it do so, the random variation in each new master sending out a ping should ensure that one is quickly turned back to a slave. It will be eventually consistent, which is good enough for our purposes.

In case this is of use to anyone else, here’s the code we use (rewritten slightly to use pure JS rather than be based on our library code). It’s also available as a gist on github. You can try it out on this test page; just open the page in several windows or tabs, then close the master and see the control pass to another. You can also broadcast a message from any tab to the other tabs.

function WindowController () {
    var now = Date.now(),
        ping = 0;
    try {
        ping = +localStorage.getItem( 'ping' ) || 0;
    } catch ( error ) {}
    if ( now - ping > 45000 ) {
        this.becomeMaster();
    } else {
        this.loseMaster();
    }
    window.addEventListener( 'storage', this, false );
    window.addEventListener( 'unload', this, false );
}

WindowController.prototype.isMaster = false;
WindowController.prototype.destroy = function () {
    if ( this.isMaster ) {
        try {
            localStorage.setItem( 'ping', 0 );
        } catch ( error ) {}
    }
    window.removeEventListener( 'storage', this, false );
    window.removeEventListener( 'unload', this, false );
};

WindowController.prototype.handleEvent = function ( event ) {
    if ( event.type === 'unload' ) {
        this.destroy();
    } else {
        var type = event.key,
            ping = 0,
            data;
        if ( type === 'ping' ) {
            try {
                ping = +localStorage.getItem( 'ping' ) || 0;
            } catch ( error ) {}
            if ( ping ) {
                this.loseMaster();
            } else {
                // We add a random delay to try avoid the race condition in 
                // Chrome, which doesn't take out a mutex on local storage. It's
                // imperfect, but will eventually work out.
                clearTimeout( this._ping );
                this._ping = setTimeout(
                    this.becomeMaster.bind( this ),
                    ~~( Math.random() * 1000 )
                );
            }
        } else if ( type === 'broadcast' ) {
            try {
                data = JSON.parse(
                    localStorage.getItem( 'broadcast' )
                );
                this[ data.type ]( data.event );
            } catch ( error ) {}
        }
    }
};

WindowController.prototype.becomeMaster = function () {
    try {
        localStorage.setItem( 'ping', Date.now() );
    } catch ( error ) {}

    clearTimeout( this._ping );
    this._ping = setTimeout( this.becomeMaster.bind( this ),
        20000 + ~~( Math.random() * 10000 ) );

    var wasMaster = this.isMaster;
    this.isMaster = true;
    if ( !wasMaster ) {
        this.masterDidChange();
    }
};

WindowController.prototype.loseMaster = function () {
    clearTimeout( this._ping );
    this._ping = setTimeout( this.becomeMaster.bind( this ),
        35000 + ~~( Math.random() * 20000 ) );

    var wasMaster = this.isMaster;
    this.isMaster = false;
    if ( wasMaster ) {
        this.masterDidChange();
    }
};

WindowController.prototype.masterDidChange = function () {};

WindowController.prototype.broadcast = function ( type, event ) {
    try {
        localStorage.setItem( 'broadcast',
            JSON.stringify({
                type: type,
                event: event
            })
        );
    } catch ( error ) {}
};

Posted in Technical. Comments Off

The technology behind the classic and new interfaces

I recently wrote a postmortem for our old interface, now I want to explain how the addition of a modern interface alongside our classic interface is different.

In short, classic is here to stay.

For all that the interface has looked similar over the past few years, it’s had many changes under the hood.

Much of the interface is fully internationalised, both in classic and new. The code is all shared with the My Opera Mail product, where multiple language support is a core requirement.

It works much more nicely on small devices, in particular with Opera Mini.

Where possible, changes for the new interface have been rewritten as shared “library code” and integrated into both interfaces simultaneously. Some things (search, for example) work differently. But most core logic, and of course all low level mail routing and storage, are fully shared across our infrastructure.

This all adds up to a pile of “invisible” work we have done to make maintenance easier in future. Even the new search uses the same query builder library, so back-porting fully cross-folder search capability to classic will be achievable if there is demand.

Unlike the old interface, which was a completely separate copy of the code and grew stale over the years, there was never a “fork” (as it’s called in software development) for the new interface.

Indeed, you may have noticed that many screens on the new interface are really just “rebranded” classic. It’s the same HTML code as the classic desktop and mobile screens, with a different title bar. When you go back to the Mail or Address tabs, it reloads the javascript and hands control back. This was a deliberate decision to speed up the areas of our site where people spend 99% of their time (statistic taken from logs, not made up) without duplicating the rarely used screens. The client-side mailbox screen uses less bandwidth and is more responsive than the classic mechanism of downloading an entire html page on every click.

When we say “supported indefinitely” it really does mean that we have no plans to remove classic. There’s no internal timeline in our heads. The core technology is used by both interfaces, and we’re updating them together.

Finally to address concerns about continued IMAP access.

We have invested heavily on improvements to the Cyrus IMAP server, both myself and Greg Banks in the Australian office (who was hired to work full time on Cyrus, and is doing an awesome job).

Our new conversation features are built directly into Cyrus, and fully integrated into its replication system. Other features like storing previews and undelete information along with messages have been created by adding support to Cyrus for the standard message annotations described in RFC5257, contributing that work back to the community.

You can read more about the Cyrus project at http://cyrusimap.org/. This reliable and standards compliant server is the core of our technology stack. We’re not moving away from IMAP, even as we extend the server to support our specific use-cases.

You can read (or even download and play with) the exact code that runs on our servers from
http://github.com/brong/cyrus-imapd/ – our production systems run on the “fastmail” branch.

Bron.

Posted in Technical. Comments Off

Changes to delete behaviour with conversations

Over the past week we have changed how deletion works in our new modern interface. In this blog post I will explain what those changes were, and some actions we have taken to ensure no emails are accidentally lost.

This is a technical blog post, so it contains a moderate level of technical detail.

I will address how our backup and disaster recovery system works, and how we used it this week to recover emails which we suspected to be accidentally deleted.

Some background

Last week we rolled out the new conversations-enabled interface. However, we discovered we had under-estimated the impact of conversations on users’ existing workflow.

In particular, many users did not realise that when they selected a single item in a folder, it represented the entire conversation (all related messages, including those the user had sent).

When they pressed ‘Delete’ with one or more conversations selected, it deleted all messages in those conversations, including messages in other folders. For example, deleting a conversation in Inbox could also delete messages from “Sent Items” and “Important – Keep”.

We have altered the ‘Delete’ action to be safer in these ways:

  1. in a folder: only messages which in the same folder are deleted from the selected conversations.
  2. when viewing search results: only messages which match the search query are deleted, messages which are in the conversations but outside the search are not.
  3. when an action will cause more than one message to be deleted from a conversation, a warning message is shown to describe what will happen. The user must explicitly disable this warning if they don’t want to see it again.

What about the time before these changes?

Rather than leaving users to hunt for which emails were affected, We wrote a tool to data-mine our mail server logs. We log every create and delete of emails, along with enough data to identify which ones were “Delete to Trash”. We can also identify if the action came from an IMAP client or the web interface.

We found emails which could have been accidentally deleted using the following algorithm:

  1. the action came from the web interface.
  2. more than one message from the same conversation was deleted within 10 seconds.

All the emails which matched these criteria were restored back into the folders they were originally deleted from, with a custom keyword added. This makes it easy for users to find them again. Every affected user has been emailed with instructions on how to identify the restored emails.

How we restored data

When you delete an email on the FastMail servers, it isn’t immediately removed from disk, even if you manually expunge via IMAP. We do this:

  1. to guarantee that our “Restore from backup” feature can always find all your emails, even if they were delivered and then deleted in between backup runs.
  2. to make deletes appear faster to users.
  3. to reduce the load on our IMAP servers. Removing files is actually one of the slowest operations you can run on a modern filesystem.

So we actually batch up all deletes and run them once per week at the least busy time for our servers – Saturday night in the USA. It’s weekend everywhere in the world then.

We also never remove email files within one week after deletion, so that our “Restore” feature can work as advertised.

This is, of course, in addition to the safety provided by replication to an offsite datacentre, and daily backups to a different server running a different operating system.

Immediate response

As soon as we realised we may have to restore emails, we disabled the automated weekend cleanup job, and started collecting data from our servers.

Discoverability

The problem is that it is hard to know that an email is not there unless you actively look for it. We could disable cleanup temporarily, but not forever. Our turnover is about 2% of total email volume per week, so the disks would fill up if we never deleted anything ever again.

We decided the safe way forwards was to undo every deletion which had even the slightest chance of being by accident.

That way, if no action is taken, a few extra emails sit on disk gathering dust. It’s possible at any later time to discover them and clean them up. There is no requirement to act quickly.

We take your privacy very seriously. No contents of emails were accessed during this task, and each user’s account was processed separately to ensure there was no risk of disclosing data. You can read more about our privacy policy here: https://www.fastmail.fm/help/overview_privacy.html.

Data collection

We log every single time a message is added to or expunged from any folder on out backend servers. We collected an initial dataset of nearly 30 million “Delete to Trash” events from the log files.

The next step was identifying which of these were a single action involving more than one message from the same conversation. Every message was tagged with a session identifier and timestamp as well as the folder and IMAP “UID” which uniquely identifies it, but we were not logging the CID (conversation identifier). We do now, but that doesn’t help with log lines from the past!

Finding the CID involved writing custom code to read the index file on disk (which still contains the deleted record) and extract the CID field for every deleted message.

Finally of course, there was processing the logs for every single connection from the web servers over that time frame and finding which deletes were related to each other. There’s nothing in the log to show that it’s the same command, so we applied a heuristic of “within 10 seconds” to account for the outside case of a busy server and large folders being processed.

Restoring messages

We use the Cyrus IMAP server. One of the utilities included is called ‘unexpunge’, and it can be used to recover deleted emails. This is different from our usual restore command, which extracts messages from various sources and appends them a new temporary folder.

In this case we want to restore messages permanently, so unexpunge is the right tool… except – we want to tag every message with a keyword, and we want it to be reliable. Finding the messages afterwards is messy. We chose to add a new feature to unexpunge, setting a user-defined keyword on each message as it is restored. It is robust, and there’s no gap where messages appear without the keyword

The chosen keyword is RESTORED-20121107. Our web interface already supports global keyword search with “flag:$name”, so the email to users includes a pre-generated URL which will perform a global search on all that user’s folders for messages which were restored.

Restores are in progress now. Once they are completed, thousands of users will have some messages restored. This is almost certain to include messages which were intended to be deleted, but we must err on the side of safety here.

We have built a very robust infrastructure because of our strong commitment to data safety. These restores are in line with this commitment. It is easier to delete unwanted messages again than to recover messages which no longer exist.

Posted in Technical. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 5,002 other followers