Improvements after outage last week

This is a technical post. Regular Fastmail users subscribed to receive email updates from the Fastmail blog can just ignore this post.

Last week we had an outage that affected all users that lasted for about 1 hour. This is one of the worst outages we’ve had in the last 4 years. Our overall reliability over the last 4 years can be put down to our redundant slots & stores architecture and using a very reliable hosting provider (NYI).

The outage last week was a sequence of events caused by a recent internal change. We changed over our internal DNS server to slave off Opera’s servers to allow better internal DNS integration. Unfortunately we were only part way through that process, and we had only setup one internal server. It’s our general policy that everything we setup these days must be replicated between at least two servers which we had intended to do, but hadn’t got around to.

That internal DNS server was also running on the server that’s our primary database server. Unfortunately that server crashed with a kernel panic. Normally we’d just fail everything over to our replica database server, but because the internal DNS server was also down, all our tools which expected to be able to resolve internal domain names also failed, and we weren’t able to fail over easily. Also because the internal DNS was down, we weren’t easily able to access the remote management module (RMM) of the server to reboot it, and had to go through the NYI ticket system, which always takes a bit longer.

The net result is something that we should have detected within a few minutes, and easily failed over with our failover tools, took almost an hour to do in the end.

We’ve now setup the internal DNS servers to be part of our standard redundant setup. We’ve also setup consistent naming and IP addresses for all our RMM modules so that they’ll be easier to access, and even if there are DNS problems, we’ll be able to access them via IP.

We can’t stop servers crashing, but we aim to have every service redundant so that if any server fails, we can fail over to a replica within a short amount of time, either automatically where possible, or manually where we think it’s better to have some human intervention first.

Overall, I believe that our continuous attempts to improve reliability have been working very well, and we always aim to learn from any problems and do better.

Update 6/Oct: I’ve posted some additional information to this forum thread.

Posted in Technical. Comments Off

Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers

This is a technical post. Regular Fastmail users subscribed to receive email updates from the Fastmail blog can just ignore this post.

So over the last couple of weeks we noticed that our new IMAP servers with 48G of RAM haven’t been performing as well as expected, and there were some oddities. Namely two things stuck out:

  1. There was free memory. There’s 20T of data on these machines. The kernel should have used lots of memory for caching, but for some reason, it wasn’t. cache ~ 2G, buffers ~ 25G, unused ~ 5G
  2. The machine has an SSDs for very hot data. In total, there’s about 16G of data on the SSDs. Almost all of that 16G of data should end up being cached, so there should be little reading from the SSDs at all. Instead we saw at peak times 2k+ blocks read/s from the SSDs. Again a sign that caching wasn’t working.

After doing some searching, we found this thread in the Linux kernel mailing list.

http://lkml.org/lkml/2009/5/12/586

It appears that patch never went anywhere, and zone_reclaim_mode is still defaulting to 1 on our pretty standard file/email/web server type machine with a NUMA kernel.

By changing it to 0, we saw an immediate massive change in caching behaviour. Now cache ~ 27G, buffers ~ 7G and unused ~ 0.2G, and IO reads from the SSD dropped to 100/s instead of 2000/s.

So if you’re using newer AMD/Intel processors with a NUMA kernel in a web server/file server/email server setup, you should make sure you set /proc/sys/vm/zone_reclaim_mode to 0. I’ve posted to the LKML about this, but haven’t heard anything, so I have no idea if anyone regards this default value as a bug or not.

Posted in Technical. Comments Off

HTML emails – from bad to worse

This is a technical post. Regular Fastmail users subscribed to receive email updates from the Fastmail blog can just ignore this post.

Originally email was designed as a text only medium (1982). Over time, various extensions were added to allow transporting attachments (1993), and for different content formats, such as HTML (1997).

HTML has become a very popular way of delivering richer email content. HTML has many tools and a large infrastructure, and is easy to display in web browsers, because that’s what they’re actually designed to do.

The problem is that HTML is a markup language, but users generally use WYSIWYG type tools to edit the content of their messages, and those tools then output HTML. Unfortunately the HTML they output is of variable quality. To make things even worse, most email reading software has limited HTML & CSS display capabilities, so senders can’t actually rely on the full range of HTML or CSS to be available. The result is that most HTML email still uses the same type of HTML we were using in 1999, with deeply nested tables and explicit attributes on each tag to layout the email content.

As a web mail provider, we have to deal with all the variable HTML content, and try and display it correctly. I’ve seen numerous odd examples over the years, from emails that use absolute positioning and fixed width and height on every single element to layout everything in a neat grid (and is horribly broken if you change the font size at all), to the messy conditional comment HTML that Microsoft Word generates.

However recently I’ve had a few extreme examples of of badly generated HTML arrive, and in each case it’s been from Mac Mail (specific header “X-Mailer: Apple Mail (2.936)”). I’ve removed the content, added some newlines between tags, and put an example here. That looks pretty ordinary, nothing funny. Now looking at the HTML that generated it (I’ve put the HTML as text with appropriate indenting here to make it easier to follow). That’s 330k in size, and 5407 lines of almost entirely HTML tags. To get the initial piece of text content, it’s 741 nested tags! Worse a lot of that is nested inline and block tags alternating one after another, which is technically invalid HTML, and really annoys our HTML tidy code that tries to fix it up.

I’ll be working to try and fix this, but at the moment, really bad emails like this can cause extremely slow display on some browsers when viewed via the webmail interface.

Posted in Technical. Comments Off

SCSI HBAs, RAID controllers and timeouts

This is a technical post. Regular Fastmail users subscribed to receive email updates from the Fastmail blog can just ignore this post.

For the last few years, most of the IMAP servers we’ve bought have followed the same hardware format. A 1U server with an LSI SCSI or SAS controller, connected to two external RAID storage units. The RAID storage units use an ARECA controller and present the internal SATA/SAS disks as SCSI/SAS volumes. This setup has worked really well and generally been very solid.

However after recently upgrading the hard drives in one of our RAID storage boxes, we started experiencing some annoying kernel errors. Under high IO load as we synced new data to them, we’d end up seeing something like this in the kernel log.

[ 1378.310010] mptscsih: ioc1: attempting task abort! (sc=ffff88083cfa6000)
[ 1378.310091] sd 2:0:0:0: [sdj] CDB: Read(10): 28 00 0d 18 ad 2d 00 00 02 00
[ 1378.682660] mptscsih: ioc1: task abort: SUCCESS (sc=ffff88083cfa6000)

These would usually be repeated many times, and sometimes we’d see things like this after the above messages.

[ 1400.805969] Errataon LSI53C1030 occurred.sc->req_bufflen=0x1000,xfer_cnt=0x400
[ 1400.827927] mptbase: ioc1: LogInfo(0x11070000): F/W: DMA Error
[ 1401.090516] mptbase: ioc1: LogInfo(0x11070000): F/W: DMA Error

Simultaneously, the RAID controller would report in it’s log:

2010-08-16 08:24:50 Host Channel 0 SCSI Bus Reset

And there would often be some corruption of any data that was being written at the time.

We’d seen a problem like this before when we’d bought new hard drives, but after upgrading the firmware in the hard drives, they’d gone away. Unfortunately in this case, the new hard drives we had already had the latest firmware, so that wasn’t something that would help.

We tried a number of things. Downgrading the SCSI bus speed to 80 MB/s. Using the latest version of the LSI driver from their website (4.22) rather than the version that comes in the vanilla Linux kernel (3.04.14). Reducing the SCSI queue depth on the LSI card from 64 to 16. Upgrade the RAID controller firmware to the very latest version. None of these things helped. In each case, with high IO load, within 10 minutes we could cause the error to occur.

My final thought was that maybe it’s timeout related. With SCSI, the HBA can queue a lot of requests to be completed out of order. So if you shove a lot of IOPs to the RAID unit (so many that the write back cache fills up) maybe the internal scheduler in the RAID controller is interacting with the TCQ in the hard drives in some way badly, and some of the requests end up taking a long time to complete. Then the HBA has some timeout amount, and if a request takes longer than that, it assumes something has gone wrong and then tries to cancel everything that’s outstanding and reset the bus.

In Linux, you can control the timeout for each SCSI target device (eg a RAID volumeset in our case) via a tunable in /sys/.

/sys/block/sd*/device/timeout

The default value for the timeout on these LSI cards is 30 seconds. I increased it to 300 seconds on all targets, and we started the IO storm again.

Normally we’d see problems within 10 minutes. We let this run for 24 hours and not a problem!

Not 100% conclusive proof, but it’s looking pretty likely that that’s culprit. So my assumption is that the LSI card has a 30 second default timeout, and the RAID unit under heavy IO load can take longer than 30 seconds to respond to some queued requests. It would explain why the problem only occurs under heavy load and when the write back cache gets filled up.

Hopefully this helps someone else if they encounter this problem one day.

Additional: So even with these changes, one of the things we noticed was that a high IO load to one RAID volume (eg. in our case, moving users around) can severely affect the performance of other RAID volumes. The issue is related to the way each SCSI HBA has a queue depth it can manage, but in the kernel, each mounted volume has it’s own outstanding request queue. When the number of volumes is large, the sum of request in the volume queues can be much larger than the HBA queue, causing poor response times as lots of processes block on IO. On our systems with a large number of volumes, reducing the per-volume queue depth (/sys/block/sd*/device/queue_depth) from the default of 64 to 16 resulted in much more even performance. Other reading.

Posted in Technical. Comments Off

New POP/IMAP server version

Over the last 24 hours, we’ve rolled out a new POP/IMAP server version for all users. This new server is the result of months of great work by Bron and includes many improvements and fixes. Not that many of the fixes are currently user visible changes, but they are significant internal improvements that help improve reliability, conformance and performance, and will allow us to build some future features we’re looking at.

  • Email replication improvements

    Email replication has been made much more efficient and reliable. The format includes CRC auto-integrity checking features, so that any unexpected mismatches between both ends are automatically detected and fixed. It can also recover automatically from unclean shutdowns or machine crashes where “split brain” has occurred, automatically fixing up mailboxes and messages. The format has also been made future extensible, allowing more features to be added without compatibility problems.

  • Performance and integrity improvements

    The internal mailbox format used to store emails has been significantly reworked. The new format has reliable locking semantics to remove all race conditions. It also stores and checks CRCs on all record data and cache data, and SHA1 checks on all message files. This ensures that any corruption in any data is detected early and can be dealt with. By moving around some of the data (such as the user seen state), and only lazily opening files as needed, the new format also improves performance in many common cases.

  • Strict MODSEQ, QRESYNC support and full IMAP test suite conformance

    Recent extensions to IMAP allow clients to more quickly synchronise data between the server and the client (eg. CONDSTORE/MODSEQ and QRESYNC). While the server has supported CONDSTORE/MODSEQ for a while, unfortunately it was a bit buggy in some situations, causing message seen state to get out of sync. The server now correctly and accurately support CONDSTORE/MODSEQ, and also supports the current QRESYNC standard that will allow clients that support it to sync even faster. We also now correctly pass detailed IMAP stress tests.

  • Major code cleanups

    All of these improvements have also been done with major internal code cleanups. This will allow us to continue building additional functionality and features more easily in the future, and to more easily fix and debug any other issues that are encountered.

Unfortunately no good deed goes unpunished, and even though we’ve been testing this code ourselves and on a sub-set of users for weeks with continuous improvements, unfortunately some bugs did get through when we finally rolled out to all users. Then in the attempt to fix these issues as quickly as possible, we also introduced some other issues. The net result was that for about 12 hours, there was a sequence of small but potentially annoying bugs that would have affected different sets of users.

  • On first access, we upgrade a mailbox to the new format. During the upgrade, we found some existing caches had allowed invalid data to enter them, causing corruptions on upgrading which caused problems when accessing these mailboxes. These cases are now caught and new cache data is built from the underlying message files
  • While reconstructing the mailboxes that had been incorrectly upgraded by the above code, a quota error caused some peoples quota to temporarily be double their actual used amount. This has been fixed now. If this bug sent a user over quota temporarily, it shouldn’t be a problem. When a user is over quota, we return a temporary 4xx error, which means no messages should have been lost, the other side should just have re-delivered when they were back under quota.
  • IMAP IDLE wasn’t returning new messages, only updating existing messages, causing pushing of new messages to most email clients to not work
  • Mail App has a bug with parsing IMAP IDLE unsolicited fetch responses that contain more than flags information. We’ve added a workaround for this Mail App bug
  • The IMAP COPYUID response was producing a non-conformant result, which caused some programs to report an error (Outlook 2010)
  • POP3 was using an optimised mode if a mailbox was empty. Unfortunately the code to mark a mailbox as “non empty” wasn’t working properly when messages were delivered, but was working for IMAP logins. This meant that messages delivered wouldn’t be downloaded by POP until you did an IMAP or web login
  • The POP3 TOP command wasn’t working, causing some programs (Outlook in POP mode) that download email headers to fail
  • The POP3 UIDL command with a message ID was producing a non-conformant result, which was parsed incorrectly by some programs. This caused some POP programs to download the same message more than once, or to delete off the server before it should have
  • Update: An update to UID sequence handling  caused the mailbox status command to report unread messages as read and vice-versa, causing the unread count on folders to actually be the read count for a short while.
  • Update: The XLIST extension wasn’t working. This has been added back, so client that support it will automatically pick the right Sent Items, Drafts, Trash, Junk Mail folders when setting up a new account
  • Update: NOOP on Mac Mail. Like the bug above with Mac Mail and IDLE, this was affecting the NOOP command as well
  • Update: Storing the \Seen flag + another flag on a message that already had the \Seen flag would cause \Seen to actually disappear. This mostly manifested as when deleting a message, it would cause it to become marked as “unread” again

All these issues have now been fixed, and we’re closely monitoring all the server logs to see if there’s any other issues, but at this stage we believe that the new server and code is working correctly for all cases we’re aware of and for all clients, IMAP and POP.

All this new code is part of the open source project cyrus, and we’ll be pushing this code back to the main cyrus code base, which will eventually form the basis for a new cyrus version 2.4. For those interested in technical details, Bron will post to the cyrus mailing lists when he’s had a bit of time to compile all the documentation and technical details.

    Posted in News, Technical. Comments Off

    Updated “Migrate IMAP” on beta server

    We’ve made some updates to the Options –> Migrate IMAP feature that are currently on the beta server. Certain edge cases that were causing problems have been fixed up, and there’s now a “No duplicates” option as well. When enabled, as each folder is migrated, the migration code will first check if the folder exists locally, and if so, retrieve a list of Message-Id headers. It will then not download any remote message with the same Message-Id. This can be useful for avoiding large numbers of duplicate emails being downloaded if for some reason a migrate only partially completes, or for some reason you already have some messages downloaded from a remote server.

    Please email me (robm@fastmail.fm) with feedback if you’re able to test this feature. Assuming there’s no issues, this will be rolled out to production soon.

    Posted in Technical. Comments Off

    New custom sieve mechanism on beta server

    We’re trialling a new system on our beta server that allows advanced users to more finely customise their email filtering rules.

    The problem with the current approach is that it’s "all" or "nothing". If you start editing the sieve script to make it custom, then you have to keep editing it as a custom script, you can’t have most it generated by the Define rules screen, with just a few custom rules inserted at the points you want.

    Also, over time we’ve added certain features that have required inserting rules into your generated script. For instance, when you use Distribution Lists and use the "Archive into folder" option, that works by bcc’ing you on every email to a special address. We insert a rule into your sieve rules to match those emails, and file it into the folder you selected. If you’re using a custom script, then that rule isn’t generated, and thus you might end up with emails in your Inbox that you don’t expect to see.

    The way the new system works is that instead of having a "custom script" option, there’s now an "Advanced" tab on the Define Rules screen. On that tab, there’s a single textarea editor. The content of that area is treated specially as "blocks". Each block has a special format, and it allows you to insert the content of the block into various parts of the sieve generation process.

    For more information on how this works, we’ve documented the process on the wiki:

    http://wiki.fastmail.fm/index.php?title=CustomSieve

    Remember all of this is currently very experimental, and subject to significant change based on feedback.

    Posted in Technical. Comments Off

    FastMail and sessions

    FastMail handles sessions slightly differently to most other web services, and coincidentally I’ve had a few questions about this recently, so I thought I’d write a blog post explaining what we do.

    For most web services, the concept of a session is something like this:

    1. A user goes to site http://example.com for the first time in their browser
    2. The server detects that the user has no existing session cookie, so it creates a new session on the server, and sends back a cookie to the browser (simply a string) that uniquely identifies that session
    3. After that, for every page within the example.com domain that the user goes to, the browser sends the cookie back to the server with every request, so the application can get the session data stored on the server

    This process is very important, because it allows the server to keep track of things for each user separately, even though they’re going to the same hostname or url path.

    Because each cookie is usually associated with a domain, that usually limits you to one session per domain. So for instance when you use google or yahoo, you can only every be logged into one account at a time. There are certain things that work around this, like Mozilla Prism, or Chrome Incognito Window, but these work by creating entirely separate cookie jars, so you can log into different accounts at the same time.

    FastMail does things a bit differently. It starts off the same, so when you go to http://www.fastmail.fm, it creates a new session and associated cookie. However when you login to your account, it does two things:

    1. It renames the cookie so the cookie itself has a unique name with a random value (we call this the cookie “salt”)
    2. It redirects you to a new URL that includes the salt value in the URL, and all further URLs as you move around the interface include this salt value

    What this means is that if you go to http://www.fastmail.fm and login, and then open a new tab, and go to http://www.fastmail.fm again, you’ll see the login screen again, and be able to login to the same account or a completely different account, and happily access both accounts in the separate tabs with separate sessions at the same time without them affecting each other!

    On top of that, FastMail tries to store most “state” information (such as which folder you’re in, what message you’re viewing, etc) in each URL rather than on the server in the session. What this means is that in many places, you can happily right-click a link and select “Open in new tab” or “Open in new window”, and although the newly opened tab/window will share the same session with the original tab/window, they won’t interact with each other.

    For instance, if you go to your Inbox and read a message, and then right-click on the “Trash” folder and select “Open in new tab”, you’ll end up with a new tab looking at messages in the Trash folder. You can then switch between the first tab and second tab, moving between folders, reading messages, and neither action will affect what happens in the other tab. (There are a few small situations where this isn’t the case, but these are rarer used functionality, and most people won’t see it)

    Session oddities

    This difference in approach does result in some things being different to most other websites

    1. If you use the “Long term” login option on the login screen, that doesn’t actually mean you get a long term session (well, you do get an 8 hour session instead of a 2 hour session, but the fact it logs you in automatically forever is separate). Instead a separate cookie is stored on your machine with your username and encrypted password, and each time you go to http://www.fastmail.fm, it creates a new login and new session
    2. Because of browser limits with the number of cookies you can have per-site, if you create many logins and many sessions (eg. open many separate tabs and login separately with each one), then we will start to expire old sessions to avoid you running out of cookies and new session failing. This starts to happen if you have > 10 simultaneous session cookies, we’ll start to expire the oldest ones. This can manifest itself as tabs where you’ve definitely been idle < 2 hours suddenly reporting your session as “expired” on the next link/button you click

    Most of the time users don’t need to know about or worry about any of this, things should “just work”, but there are some edge cases where knowing the underlying cause of things can help explain some odd cases.

    Posted in Technical. Comments Off

    ns2.messagingengine.com IP address change

    A few days ago, we changed the IP address of ns2.messagingengine.com from 64.34.176.20 to 74.52.187.89. In general, this shouldn’t have affected any users because we tell users to point their domains to our name server records ns1.messagingengine.com and ns2.messagingengine.com.

    However we’ve noticed that a few users have slightly esoteric DNS setups where they’ve used the explicit IP address of the old ns2.messagingengine.com server. For those users, you need to update your DNS setup. We highly recommend that if possible, you don’t ever use the explicit IP address, but use appropriate server names or CNAMEs. This allows us to change the IP addresses of ns1 or ns2 without any user changes being required.

    Posted in Technical. Comments Off

    Opportunistic SSL/TLS encryption on outgoing emails

    Early last year, we implemented opportunistic encryption of incoming emails. We’ve now done the same thing for outgoing emails. This means that for email sent via FastMail (either the web interface or SMTP), if the receiving server advertises STARTTLS support in response to an EHLO command, we will try and negotiate a secure connection and send the email over a secure connection. However if the negotiation fails for any reason, we’ll fallback to a standard connection.

    Some extra notes on this:

    • This doesn’t change what happens when sending via SMTP from your email client/software (eg. Outlook, Apple Mail, Thunderbird, etc) to the FastMail servers. Whether that connection is encrypted is controlled by your email client/software and what you configured when you set it up
    • We can’t force a remote server to use encryption, it’s only if the remote server advertises support for encryption that we try and setup a secure channel
    • This is not full end-to-end encryption of the email. The email is only encrypted during transit from our server to the other server, once it’s at their side, it’s stored in whatever form the other side stores emails. For full end-to-end encryption, you need something like PGP
    • The only way to tell if an email has gone over an encrypted connection or not is to look at the headers of the email on the received side. In other words, as a sender, it’s unfortunately not easy to tell, only the receiver can easily tell

    Since this feature is only enabled if the other side advertises support for it, it shouldn’t affect sending to servers that don’t support STARTTLS, and thus shouldn’t affect the deliverability of email in any way.

    Posted in Technical. Comments Off
    Follow

    Get every new post delivered to your Inbox.

    Join 3,164 other followers