Intermittent bayes db corruption resolved

This is a technical post that describes the history and recent efforts to track down a bug that was corrupting some users bayes databases. Fastmail users subscribed to receive email updates from the Fastmail blog can ignore this post if they are not interested.

Over the past few years, we’ve had sporadic reports of users bayes databases being corrupted and reset back to empty. When this happened, it would cause email delivery for that user to fall back to using the global bayes database, which decreased the overall accuracy of their spam detection until they retrained the database with more spam and non-spam messages.

I had tried multiple times to track down what was causing this issue, but each time with no luck. Each time the problem occurred, there was an error message in the logs of this form.

bayes: bayes db version 0 is not able to be used, aborting!

Often searching the internet for an error message will find other people that have had the same problem and tracked down the solution, but in this case it didn’t. Each time I tried to work through the code to see what was going wrong, I reached a dead end and couldn’t see any obvious problem.

Since the corruptions were very intermittent and losing a bayes database isn’t critical, doesn’t cause email to be lost or inaccessible, and can be rebuilt just by reporting email as spam/non-spam again, tracking this down was always a bit of a lower priority issue.

Recently though, after one more corruption report too many, I decided once and for all to track down what was causing it. Bit by bit over the course of several weeks, I added more and more logging information to the server code to track down where in the code the problem was occurring.

The logging results proved to be very odd. In the vast majority of cases it showed that writing to a particular database worked fine, but every now and then, it caused data to be lost. Eventually I managed to create a reproducible test case. It turned out to be very odd issue because performing a particular programming action with a database library worked fine the first 5 or 6 times, but on the 6th or 7th, it would cause data to become lost. Clearly something odd is happening in the lower level library code.

Fortunately there was a straight forward workaround to the problem, so I’ve now patched our code with the workaround, and over the last few weeks I’ve monitored the logs which show the original error message above has completely disappeared and no databases are being corrupted any more.

I’ve reported bugs to the underlying modules causing the problems, so hopefully long term they’ll fixed as well.

https://rt.cpan.org/Public/Bug/Display.html?id=83060

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6901

Posted in Technical. Comments Off

Update to DNS hosting

We’ve rolled out a change to our DNS hosting abilities to switch our backend from tinydns to powerdns. We’d previously tried this change once before but had some problems and had to roll back. After some more development work and testing, we believe we’ve fixed all these issues and so have moved forward to powerdns again.

This change should initially be invisible to users and things should continue to work as they were. In the long term, it will allow us to support more features and faster updates to DNS in the future.

Posted in Technical. Comments Off

Inter-tab communication using local storage

A few weeks ago we launched our new webmail service for all users at FastMail. Once being used by a wider audience, we of course received reports of a few edge cases our testing hadn’t managed to uncover. One of the more interesting issues we discovered came from this use case: our user liked to scroll down his inbox, opening each email he wanted to read in a new tab in the background. Then he would go through the tabs, closing each one as he was done with it. So far, so good. Except, in Chrome, his browser of choice, as soon as about 5 tabs were open, the rest failed to load, and the earlier ones then started having communication errors as well.

A quick bit of research and testing yielded the problem: Chrome limits itself to a maximum of 6 concurrent connections to a single origin across the whole browser. Each tab was loading a full instance of the mail application, which meant it was creating an EventSource object and connection to our push server, to be notified of new deliveries (see this previous post for how that works). Since these connections are permanent (that’s the whole idea!), opening lots of tabs quickly used up all the available connections, with none left to fetch any actual data. To the user, this appeared as “Could not connect to server” error messages.

The solution to this problem was not immediately obvious. Ideally, we would like to maintain a single push connection and share it between the tabs, but there’s no API for getting a reference to other tabs or windows in the browser, even if they’re pointed to the same domain. Then I remembered that setting a property on local storage triggers a “storage” event on the window object of every open tab with the same origin. This, I realised, could be used to synchronise behaviour across tabs.

The concept is fairly simple. Only one tab keeps a push connection; we call this the master tab. When it receives a push event, it broadcasts it by setting the event as a property on local storage called “broadcast”. When a tab receives the storage event for this key, it reads the JSON-encoded event object from local storage and processes it as though it had been received via an EventSource object.

The tricky part comes in coordinating between the tabs who should be master. The master tab also sets a value called “ping” on local storage roughly every 30 seconds to the current time stamp. When a tab first loads it checks for this value; if it is greater than 45 seconds ago it presumes there is no current master, so it becomes master. Otherwise, it becomes a slave. However, whilst it is a slave, it continuously monitors for storage events with a key of “ping”, and if it hasn’t heard a ping within a 45 second period, it takes over as master. This switches control to another tab when the master tab closes. On browsers supporting the “unload” event we can make the changeover happen pretty much instantly, by setting the “ping” value to 0 in local storage when the tab is closed.

This all works very well, but there’s one problem remaining: race conditions. There is no API for taking out an explicit lock on local storage, so the spec advocates the use of a per-origin mutex which would be acquired by scripts once they try to access the storage, and then released when the script finishes. Not all browsers have adopted this. The Chrome developers, for example, have decided the performance penalty is too great. Therefore, in some browsers, it is possible for scripts in different tabs to interleave such that, for example, each tries to take master at the same time, then each notices another has taken it so none end up as master! The solution we have adopted is to add a random component to the delay between pings and waiting for pings. This makes it unlikely that two tabs will both attempt to take master at the same time. Of course this can still happen, but should it do so, the random variation in each new master sending out a ping should ensure that one is quickly turned back to a slave. It will be eventually consistent, which is good enough for our purposes.

In case this is of use to anyone else, here’s the code we use (rewritten slightly to use pure JS rather than be based on our library code). It’s also available as a gist on github. You can try it out on this test page; just open the page in several windows or tabs, then close the master and see the control pass to another. You can also broadcast a message from any tab to the other tabs.

function WindowController () {
    var now = Date.now(),
        ping = 0;
    try {
        ping = +localStorage.getItem( 'ping' ) || 0;
    } catch ( error ) {}
    if ( now - ping > 45000 ) {
        this.becomeMaster();
    } else {
        this.loseMaster();
    }
    window.addEventListener( 'storage', this, false );
    window.addEventListener( 'unload', this, false );
}

WindowController.prototype.isMaster = false;
WindowController.prototype.destroy = function () {
    if ( this.isMaster ) {
        try {
            localStorage.setItem( 'ping', 0 );
        } catch ( error ) {}
    }
    window.removeEventListener( 'storage', this, false );
    window.removeEventListener( 'unload', this, false );
};

WindowController.prototype.handleEvent = function ( event ) {
    if ( event.type === 'unload' ) {
        this.destroy();
    } else {
        var type = event.key,
            ping = 0,
            data;
        if ( type === 'ping' ) {
            try {
                ping = +localStorage.getItem( 'ping' ) || 0;
            } catch ( error ) {}
            if ( ping ) {
                this.loseMaster();
            } else {
                // We add a random delay to try avoid the race condition in 
                // Chrome, which doesn't take out a mutex on local storage. It's
                // imperfect, but will eventually work out.
                clearTimeout( this._ping );
                this._ping = setTimeout(
                    this.becomeMaster.bind( this ),
                    ~~( Math.random() * 1000 )
                );
            }
        } else if ( type === 'broadcast' ) {
            try {
                data = JSON.parse(
                    localStorage.getItem( 'broadcast' )
                );
                this[ data.type ]( data.event );
            } catch ( error ) {}
        }
    }
};

WindowController.prototype.becomeMaster = function () {
    try {
        localStorage.setItem( 'ping', Date.now() );
    } catch ( error ) {}

    clearTimeout( this._ping );
    this._ping = setTimeout( this.becomeMaster.bind( this ),
        20000 + ~~( Math.random() * 10000 ) );

    var wasMaster = this.isMaster;
    this.isMaster = true;
    if ( !wasMaster ) {
        this.masterDidChange();
    }
};

WindowController.prototype.loseMaster = function () {
    clearTimeout( this._ping );
    this._ping = setTimeout( this.becomeMaster.bind( this ),
        35000 + ~~( Math.random() * 20000 ) );

    var wasMaster = this.isMaster;
    this.isMaster = false;
    if ( wasMaster ) {
        this.masterDidChange();
    }
};

WindowController.prototype.masterDidChange = function () {};

WindowController.prototype.broadcast = function ( type, event ) {
    try {
        localStorage.setItem( 'broadcast',
            JSON.stringify({
                type: type,
                event: event
            })
        );
    } catch ( error ) {}
};

Posted in Technical. Comments Off

The technology behind the classic and new interfaces

I recently wrote a postmortem for our old interface, now I want to explain how the addition of a modern interface alongside our classic interface is different.

In short, classic is here to stay.

For all that the interface has looked similar over the past few years, it’s had many changes under the hood.

Much of the interface is fully internationalised, both in classic and new. The code is all shared with the My Opera Mail product, where multiple language support is a core requirement.

It works much more nicely on small devices, in particular with Opera Mini.

Where possible, changes for the new interface have been rewritten as shared “library code” and integrated into both interfaces simultaneously. Some things (search, for example) work differently. But most core logic, and of course all low level mail routing and storage, are fully shared across our infrastructure.

This all adds up to a pile of “invisible” work we have done to make maintenance easier in future. Even the new search uses the same query builder library, so back-porting fully cross-folder search capability to classic will be achievable if there is demand.

Unlike the old interface, which was a completely separate copy of the code and grew stale over the years, there was never a “fork” (as it’s called in software development) for the new interface.

Indeed, you may have noticed that many screens on the new interface are really just “rebranded” classic. It’s the same HTML code as the classic desktop and mobile screens, with a different title bar. When you go back to the Mail or Address tabs, it reloads the javascript and hands control back. This was a deliberate decision to speed up the areas of our site where people spend 99% of their time (statistic taken from logs, not made up) without duplicating the rarely used screens. The client-side mailbox screen uses less bandwidth and is more responsive than the classic mechanism of downloading an entire html page on every click.

When we say “supported indefinitely” it really does mean that we have no plans to remove classic. There’s no internal timeline in our heads. The core technology is used by both interfaces, and we’re updating them together.

Finally to address concerns about continued IMAP access.

We have invested heavily on improvements to the Cyrus IMAP server, both myself and Greg Banks in the Australian office (who was hired to work full time on Cyrus, and is doing an awesome job).

Our new conversation features are built directly into Cyrus, and fully integrated into its replication system. Other features like storing previews and undelete information along with messages have been created by adding support to Cyrus for the standard message annotations described in RFC5257, contributing that work back to the community.

You can read more about the Cyrus project at http://cyrusimap.org/. This reliable and standards compliant server is the core of our technology stack. We’re not moving away from IMAP, even as we extend the server to support our specific use-cases.

You can read (or even download and play with) the exact code that runs on our servers from
http://github.com/brong/cyrus-imapd/ – our production systems run on the “fastmail” branch.

Bron.

Posted in Technical. Comments Off

Changes to delete behaviour with conversations

Over the past week we have changed how deletion works in our new modern interface. In this blog post I will explain what those changes were, and some actions we have taken to ensure no emails are accidentally lost.

This is a technical blog post, so it contains a moderate level of technical detail.

I will address how our backup and disaster recovery system works, and how we used it this week to recover emails which we suspected to be accidentally deleted.

Some background

Last week we rolled out the new conversations-enabled interface. However, we discovered we had under-estimated the impact of conversations on users’ existing workflow.

In particular, many users did not realise that when they selected a single item in a folder, it represented the entire conversation (all related messages, including those the user had sent).

When they pressed ‘Delete’ with one or more conversations selected, it deleted all messages in those conversations, including messages in other folders. For example, deleting a conversation in Inbox could also delete messages from “Sent Items” and “Important – Keep”.

We have altered the ‘Delete’ action to be safer in these ways:

  1. in a folder: only messages which in the same folder are deleted from the selected conversations.
  2. when viewing search results: only messages which match the search query are deleted, messages which are in the conversations but outside the search are not.
  3. when an action will cause more than one message to be deleted from a conversation, a warning message is shown to describe what will happen. The user must explicitly disable this warning if they don’t want to see it again.

What about the time before these changes?

Rather than leaving users to hunt for which emails were affected, We wrote a tool to data-mine our mail server logs. We log every create and delete of emails, along with enough data to identify which ones were “Delete to Trash”. We can also identify if the action came from an IMAP client or the web interface.

We found emails which could have been accidentally deleted using the following algorithm:

  1. the action came from the web interface.
  2. more than one message from the same conversation was deleted within 10 seconds.

All the emails which matched these criteria were restored back into the folders they were originally deleted from, with a custom keyword added. This makes it easy for users to find them again. Every affected user has been emailed with instructions on how to identify the restored emails.

How we restored data

When you delete an email on the FastMail servers, it isn’t immediately removed from disk, even if you manually expunge via IMAP. We do this:

  1. to guarantee that our “Restore from backup” feature can always find all your emails, even if they were delivered and then deleted in between backup runs.
  2. to make deletes appear faster to users.
  3. to reduce the load on our IMAP servers. Removing files is actually one of the slowest operations you can run on a modern filesystem.

So we actually batch up all deletes and run them once per week at the least busy time for our servers – Saturday night in the USA. It’s weekend everywhere in the world then.

We also never remove email files within one week after deletion, so that our “Restore” feature can work as advertised.

This is, of course, in addition to the safety provided by replication to an offsite datacentre, and daily backups to a different server running a different operating system.

Immediate response

As soon as we realised we may have to restore emails, we disabled the automated weekend cleanup job, and started collecting data from our servers.

Discoverability

The problem is that it is hard to know that an email is not there unless you actively look for it. We could disable cleanup temporarily, but not forever. Our turnover is about 2% of total email volume per week, so the disks would fill up if we never deleted anything ever again.

We decided the safe way forwards was to undo every deletion which had even the slightest chance of being by accident.

That way, if no action is taken, a few extra emails sit on disk gathering dust. It’s possible at any later time to discover them and clean them up. There is no requirement to act quickly.

We take your privacy very seriously. No contents of emails were accessed during this task, and each user’s account was processed separately to ensure there was no risk of disclosing data. You can read more about our privacy policy here: https://www.fastmail.fm/help/overview_privacy.html.

Data collection

We log every single time a message is added to or expunged from any folder on out backend servers. We collected an initial dataset of nearly 30 million “Delete to Trash” events from the log files.

The next step was identifying which of these were a single action involving more than one message from the same conversation. Every message was tagged with a session identifier and timestamp as well as the folder and IMAP “UID” which uniquely identifies it, but we were not logging the CID (conversation identifier). We do now, but that doesn’t help with log lines from the past!

Finding the CID involved writing custom code to read the index file on disk (which still contains the deleted record) and extract the CID field for every deleted message.

Finally of course, there was processing the logs for every single connection from the web servers over that time frame and finding which deletes were related to each other. There’s nothing in the log to show that it’s the same command, so we applied a heuristic of “within 10 seconds” to account for the outside case of a busy server and large folders being processed.

Restoring messages

We use the Cyrus IMAP server. One of the utilities included is called ‘unexpunge’, and it can be used to recover deleted emails. This is different from our usual restore command, which extracts messages from various sources and appends them a new temporary folder.

In this case we want to restore messages permanently, so unexpunge is the right tool… except – we want to tag every message with a keyword, and we want it to be reliable. Finding the messages afterwards is messy. We chose to add a new feature to unexpunge, setting a user-defined keyword on each message as it is restored. It is robust, and there’s no gap where messages appear without the keyword

The chosen keyword is RESTORED-20121107. Our web interface already supports global keyword search with “flag:$name”, so the email to users includes a pre-generated URL which will perform a global search on all that user’s folders for messages which were restored.

Restores are in progress now. Once they are completed, thousands of users will have some messages restored. This is almost certain to include messages which were intended to be deleted, but we must err on the side of safety here.

We have built a very robust infrastructure because of our strong commitment to data safety. These restores are in line with this commitment. It is easier to delete unwanted messages again than to recover messages which no longer exist.

Posted in Technical. Comments Off

New login and session management code on beta.fastmail.fm

We’ve just rolled out some new code on our beta server that significantly changes how sessions are managed. This new code reduces some overall session complexity, fixes some long term bugs, and adds some useful new features.

  1. There’s now just 2 main types of sessions: normal & long term
    • normal – these expire after 2 hours of inactivity
    • long term (you check the “Keep me logged in” checkbox on login) – these expire after 30 days of inactivity, for most people on most machines, this is effectively forever

    (Note: The "Keep me logged in" checkbox has been broken for the last few months on the beta server, but now correctly creates a long term login session. Also the "lightbox" login screen within the new UI now correctly works.)

  2. Logout will explicitly end a session 

    If you want to explicitly end a session, use the "Log out" link at the top right of the page. If you want to keep a session, just close the browser tab/window and when you go back to the beta server, you’ll still be logged in (see below).

  3. You can still log in to multiple different accounts

    We still support the ability to log in to multiple different user accounts at the same time on the same device/browser.

  4. You can access existing logged in sessions from the login screen

    If your device/browser has any existing logged in sessions, we now show those sessions when you go to the login screen. Simply clicking on one of those sessions will send you straight back to that mailbox for that user.

    Although by default the login screen shows existing logged in sessions, clicking the "Log in to another account" link will allow you to log in to another account at the same time.

  5. You can see (and remotely log out) all logged in web sessions on all devices/browsers

    We now track all sessions in our database and allow users to see all these sessions and remotely log out any of them individually.

    Just go to Options/Accounts –> Logged In Sessions to see all sessions in all devices/browsers. Currently only sessions created on http://beta.fastmail.fm can be deleted.

    (Note: Only web sessions are shown. IMAP/POP/XMPP/etc logins are shown on the Options -> Login Log screen)

One observation that some people might make is that with the old system, if you were logged into your account, and then closed your browser window/tab or went to http://beta.fastmail.fm again, it would appear that your existing session was automatically logged out, a nice security feature.

In fact that was never the case, the session was not logged out. Simply picking the right URL from your browser history would take you straight back in. There was just no visual indication on the login screen that this existing session was still present in your browser cookies, which is actually quite dangerous. The new system correctly shows any existing sessions on the login screen. If you want to end a session, you must use the "Log out" link at the top right of the page, whether you’re using the new system or the current system still at http://www.fastmail.fm.

Posted in News, Technical. Comments Off

Goodbye old.fastmail.fm

Summary

In early 2009 we rolled out an updated web interface to all users. This is the interface you currently see when you login at http://www.fastmail.fm as most users do.

To give users time to transition, we continued to let people login to the old pre-2009 interface if they wanted to by going to the web address http://old.fastmail.fm. We’ve continued to support this for the last 3 years, but as only a few users are still using this interface, we decided to shut it down. For the last 3 months there’s been a prominent message each time you logged into http://old.fastmail.fm that noted this, and we’ve now fully shut down http://old.fastmail.fm.

Important point: This only affects users that were explicitly going to http://old.fastmail.fm to login. Users that use the regular interface at http://www.fastmail.fm (the vast majority) are completely unaffected.

The description below is a detailed history of the old interface and includes technical details about how much things have changed since 2009 and why maintaining http://old.fastmail.fm is no longer feasible.

Goodbye old.fastmail.fm

It’s been a long road, but the old FastMail web interface has finally reached the end of its life.

You can always access your email at https://www.fastmail.fm/ or try our beta site at https://beta.fastmail.fm/.

If you want to stop reading here, the things you need to know are security concerns and it was about to break anyway. Two good reasons why now is the right time to shut the old interface down.

If you want to know some of the technical background and the technologies that we have moved through over the years, read on!

A new infrastructure

Looking back through our version control history, my very first commit was on 2004-09-20! The original web interface commits are from early 2000, though it was started before then.

We switched version control systems at some point during 2005 from CVS to Subversion, which made branching much easier – but imported all our history, so we can still look back at those early changes.

One of our major branches was a huge infrastructure switch from Redhat 7.3 to Debian 3.1 (sarge), which we worked on throughout the second half of 2005. This was all merged back into the main branch, and we converted everything over in early 2006.

http://blog.fastmail.fm/2006/01/02/one-web-server-now-running-new-infrastructure-code/

We upgraded to Debian 4.0 (etch) during May 2007, soon after it came out.

A new interface

In 2008, Neil Jenkins (who is so awesome Opera hired him even before they had decided if they were going to buy FastMail) worked as a contractor over the summer to design a more modern web interface which would take advantage of the new features in web browsers.

We branched the code, and it diverged quite considerably. Features like cross folder searching required major internal datastructure changes, and the new interface had hooks all through the code. Our plan was always to retire the old code eventually.

We released the new interface to beta at the end of 2008, and rolled it out to everyone in 2009.

http://blog.fastmail.fm/2008/11/27/help-beta-test-new-web-interface/

http://blog.fastmail.fm/2009/02/17/new-interface-being-rolled-out/

An incompatible upgrade

Then in 2009, Debian 5.0 (lenny) came out. Lenny shipped with apache2 and mod_perl2, and no longer supported apache 1.3 or mod_perl version 1. We put quite a lot of work into porting our codebase forwards to apache2. Since "old" was going away soon, we didn’t duplicate the work there.

So we installed the new web servers on lenny, and kept a couple of servers called "oldweb" still running etch. It’s amusing now to remember all the hoops I jumped through to allow automatic installation of either system.

About this time we also had machines with enough memory that 32 bit address spaces were wasteful, particularly on the IMAP servers. We moved to running 64 bit kernels with 32 bit userland.

New hardware

In 2010, the Opera sale happened. One of the early steps was to replace some of our aging hardware with equipment that was better understood and supported by the Opera sysadmin department. This meant a new bladecentre for the non-storage systems (including web)

For a little while I had two blades (redundancy!) running "oldweb" code. That’s a huge amount of very under-utilised resource.

And, to be honest, managing new blades with ancient OS was a pain. Things didn’t work well. The configuration tools we built for the new hardware didn’t run on etch.

When we moved to Debian 6.0 (squeeze) and at the same time went fully 64 bit, it was time to do something about "old".

We also moved version control systems AGAIN in late 2010 – from subversion to git. The old web servers were left on subversion, because they weren’t getting much in the way of changes any more. One more "split" in how things were done.

Fully virtual

Rather than having to support "real hardware", I built an etch virtual machine. Everything else was running squeeze 64 bit, but we still had a full 32 bit etch install path just to support oldweb.

While all this was happening, there were occasional changes required to support changing database schemas, configuration mechanisms, and interaction with other parts of our system. At some point I just took a snapshot of the current tree and started a new git repository so we could archive the subversion server entirely.

Maintaining the virtual machines was a real pain though. They were run in the background on some of the web servers to free up the hardware for more demanding tasks. This meant changing the network interfaces to be bonding drivers, custom configuration, lots of pain. There were occasionally long outages as we changed things and then had to patch oldweb to catch up.

Worst of all, we were maintaining the ENTIRE stack – support daemons, log rotation, pop fetching…

Old lost features over time – we just couldn’t keep them working, so we ripped the code out. Particularly some of the more advanced configuration screens – and everything related to billing.

Single component

In the end the virtual machines were too much work. Our authentication system in particular had many changes under the hood, and it just wasn’t going to keep working. We had a couple of really bad problems with file storage, where we were sure that something "couldn’t happen", but then it turned out old was still doing things differently. Talking to the wrong databases, running the wrong queries. We seriously considered dropping old at that point, but I wanted to give it a bit longer.

So I build a chroot installation of etch on our web servers, and bind mounted the daemon sockets into the chroot. This allowed us to run just the web interface code itself on the old branch, while running everything else in the modern, managed, outside world. I built a custom init script which could set up all the necessary mountpoints (/proc, /dev, /var/run, even the tmpfs with mmaped caches was shared) – and forward ported more of the code.

This was built with debootstrap originally, but in the end it was getting unreliable even fetching etch packages, so I build a .tar.gz file with the filesystem for the chroot, and a fresh install just unpacked that. As we changed internal config systems, I kept "oldweb" up to date. A couple of commits every month.

So that’s brings us to today. An init script (apache-oldweb), a chroot environment with a snapshot of a Debian etch machine with apache 1.3 and mod_perl version 1 – running perl 5.8. Everything else is perl 5.10 or newer, so I even have to backport some idioms as I bring back the bits which it just can’t live without.

I have done basically all the "keeping old alive" for the past couple of years – for a smaller and smaller set of users who still log in there. Backporting everyone else’s changes as they impacted old.

And etch doesn’t have security support. Hasn’t for ages. Sure it’s in a chroot, but it still has access to everything.

The final straw

But there’s one thing which oldweb can’t survive. We are redesigning how our session management works. There are some great benefits – bookmarkable URLs, remote logout of stale sessions, reduction of password typing on annoying little smartphone keyboards.

Everything will change, and old would have just stopped working. It’s not worth the changes to make it work. Particularly with the larger gap between the two systems as time goes on.

Also, and even worse, old interface is exposed to the wider internet – and it has full read/write access to the database and all emails. If there are security problems, all our users are at risk – not only those who use it directly.

It’s no longer safe, and it was going to break beyond easy repair in a few days anyway. It is time.

Goodbye oldweb

Posted in News, Technical. Comments Off

One step forward, two steps back

It’s been a really bad week for me.  Backing out two significant pieces of work.  One only released recently, but the other having caused problems for an entire year, and I’m really sorry to those who’ve been sitting through them as we didn’t have the effort available to find the underlying cause.

PowerDNS

First the recent one.  We switched out DNS servers from tinydns to powerdns last week.  There were very good reasons for the switch, tinydns as-is doesn’t support IPv6, or DNSSEC, or zone transfers, or…

And the data file is built as a single giant database and synchronised to all our servers once per hour, so updates take some time to be made.

On the flip side – it’s rock solid!  It’s served us well for years.  So we put a lot of work into testing PowerDNS for the change.  Unfortunately, it wasn’t enough.  First SOA records were broken for subdomains, then DNS delgation didn’t work, and now that I’ve switched back, a problem with Chat server aggregation has gone away, so it was probably doing the wrong thing there too!

Anyway – powerdns got backed out.  The “pipe” backend that we were using just isn’t expressive enough, so we either need to find another way to do it, or find another path forward.  The good thing about PowerDNS is that it’s actively maintained, so we should be able to get somewhere here.

EJabberd

It’s much sadder to give up EJabberd.  Erlang is an interesting language, and the integration work was done by an intern last year – he did really good work.  The hard bit was that we needed support for the many thousands of domains that our customers host with us.  Ejabberd 2.x (the stable branch) just didn’t support it.  Ejabberd 3.x was going to, but was currently in alpha.  Looking at the development pace, I made the call to integrate with Ejabberd.  We did that.

But it’s been plagued with problems.  The chat logging service has been flaky, there have been “Malformed XML” disconnections which I suspect to be related to incorrect SSL renegotiations, but I haven’t been able to prove it.  I’ve spent far too much time looking at packet logs and trying to figure it out.

I’ve had long standing tickets about it, and kept saying “it’s getting better” – but seriously, upstream hasn’t made a single commit to ejabberd mainline since February this year.  They’re putting all their effort into the 2.x branch.

So I’m in the process of backporting our chat service to the DJabberd engine we used to use.  It’s not perfect either – it doesn’t have anywhere near the feature set that ejabberd has, and it’s not getting any more support.  The code is of OK quality, but it’s quite convoluted and written in many different styles which makes reading it tricky.  I’ve had to make two patches to get interoperability up to scratch with modern servers and support the multiple SSL certificates we now use.

It’s always sad to give up features, and to sideline hard work that you or others have done – but in the end we have been hurting customers by providing a sub-standard experience with chat.  So I’m hoping to put a line under that by the end of this week and be able to move on with good new things again.  At least a couple of us have some more Erlang experience now, and you never know when that might be useful.  It’s good just to understand different ways of thinking about code.

Posted in Technical. Comments Off

*.fastmail.fm certificate updated

The SSL certifcate for *.fastmail.fm (that is, http://www.fastmail.fm, and all other subdomains) was due to expire in a couple of months, so this morning we’ve updated it to a new one.

Because its not something we do often, we don’t have a nice little script automating the task. As a result I forgot to properly add the CA chain to the certificate so there was a period of about 30 minutes where some users with old browsers might have seen “invalid certificate” errors. That was fixed pretty quickly once we noticed and everything should be fine now.

The next certificate to expire has a few months left on it. I’ll make sure this is nicely automated before we have to update that one!

Posted in Technical. Comments Off

Chat server updated

We’ve had a couple of persistent problems with our XMPP chat server since upgrading to Ejabberd (3.0 pre-release with patches) last year.

  1. “Malformed XML” errors.  We suspect these were due to bugs with handling SSL.
  2. Missing IP address and SSL information in the LoginLog.
  3. ChatLog backend died, and chat logging stopped.

We have our awesome intern Samuel back again this year, and though he’s working with a different department this time – I managed to grab some of his time to work with me on fixing these long standing issues!

We think we’ve fixed number 1 – at least it hasn’t been seen since updating the upstream libraries.

We’re positive we’ve fixed number 2, by making sure the correct information is passed directly to the authentication functions rather than looking for it in the session object (which isn’t populated in time to be useful)

And as for number 3 – a restart fixed it.  I’m now monitoring the log files to look for what might cause it to die.  Unfortunately, by the time we looked at it the logs had already rotated away, so there was no evidence left!  I’ll make sure we catch the problem next time, if it happens again.
TL;DR – chat server is updated and should be more stable and give better information than before.

 

Posted in Technical. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 4,629 other followers