A story of leaping seconds

I’ve been promising to blog more about some technical things about the FastMail/Opera infrastructure, and the recent leap second fiasco is a good place to point out not only some failures, but also the great things about our system which helped limit the damage.

Being on duty

First of all, it was somewhat random that I was on duty. We’ve recently switched to a more explicit “one person on primary duty” from the previous timezone and time-of-day based rules. Partially so other people knew who to contact, and partially to bring some predictability to when you had to be responsible.

Friday 29th June was my last working day for nearly 3 weeks. My wife is on the other side of the world, and the kids are off school. I was supposed to be handing duty off to one of the Robs in Australia (we have two Robs on the roster – one Rob M and the other Rob N. This doesn’t cause any confusion because we’re really good at spelling). I had made some last minute changes (as you always do before going away) and wanted to keep an eye on them, so I decided to stay on duty.

Still, it was the goodbye party for a colleague that night, so I took the kids along to Friday Beer (one of the awesome things about working for Opera) and had quite a few drinks. Not really drunk, but not 100% sober either. Luckily, our systems take this into account… in theory.

The first crash

Fast forward to 3:30am, and I was sound asleep when two SMSes came in quick succession. Bleary eyed, I stumbled out to the laptop to discover that one of our compute servers had frozen. Nothing on the console, no network connectivity, no response to keystrokes via the KVM. These machines are in New York, and I’m in Oslo, so kicking it in person is out of the question – and techs onsite aren’t going to get anything I don’t.

These are Dell M610 blades. Pretty chunky boxes. We have 6, split between two bladecentres. We can easily handle running on 3, mail delivery just slows by a few milliseconds – so we have each one paired with one in another bladecentre and two IP addresses shared between them. In normal mode, there’s one IP address per compute server.  (amongst the benefits of this setup – we can take down an entire bladecentre for maintenance in about 20 minutes – with no user visible outage)

Email delivery is hashed between those 6 machines so that the email for the same user goes to the same machine, giving much better caching of bayes databases and other per-user state. Everything gets pushed back to central servers on change as well, so we can afford to lose a compute server, no big deal.

But – the tool for moving those IP addresses, “utils/FailOver.pl”, didn’t support the ‘-K servername’ parameter, which means “assume that server is dead, and start up the services as if they had been moved off there”.  It still tries to do a clean shutdow, but on failure it just keeps going. The idea is to have a tool which can be used while asleep, or drunk, or both…

The older “utils/FailOverCyrus.pl”, which is still used just for the backend mail servers, does support -K. It’s been needed before. The eventual goal is to merge these two tools into one, which supports everything. But other things keep being higher priority.

So – no tool. I read the code enough to remind myself how it actually works, and hand wrote SQL statements to perform the necessary magic on the database so that the IP could be started on the correct host.

I could have just bound the IP by hand as well. But I wanted to not confuse anything. Meanwhile I had rebooted the host, and everything looked fine, so I switched back to normal mode.

Then (still at 3:something am) I wrote -K into FailOver.pl. Took a couple of attempts and some testing to get it right, but it was working well before I wrote up the IncidentReport and went back to bed some time after 4.

The IncidentReport is very important – it tells all the OTHER admins what went wrong and what you did to fix it, plus anything interesting you saw along the way. It’s super useful if it’s not you dealing with this issue next time. It included the usage instructions for the new ‘-K’ option.

Incident 2 – another compute server and an imap server

But it was me again – 5:30am, the next blade over failed. Same problem, same no way to figure out what was wrong. At least this time I had a nice tool.

Being 5:30am, I just rebooted that blade and the other one with the same role in the same datacentre, figuring that the thing they had in common was bladecentre2 compute servers.

But while this was happening, I also lost an imap server. That’s a lot more serious. It causes user visible downtime – so it gets blogged.

From the imap server crash, I also got a screen capture, because blanking wasn’t enabled for the VGA attached KVM unit. Old server. This capture potentially implicated ntpd, but at the time I figured it was still well in advance of the leap second, so probably just a coincidence. I wasn’t even sure it was related, since the imap server was an IBM x3550, and the other incidents had been blades. Plus it was still the middle of the night and I was dopey.

Incident 3

Fast forward to 8:30am in the middle of breakfast with my very forgiving kids who had let me sleep in. Another blade, this time in the other blade centre. I took a bit longer over this one – decided the lack of kernel dumps was stupid, and I would get kdump set up. I already had all the bits in place from earlier experiments, so I configured up kdump and rolled it onto all our compute servers. Rob M (see, told you I could spell) came online during this, and we chatted about it and agreed that kdump was a good next step.

I also made sure every machine was running the latest 3.2.21 kernel build.

Then we sat, and waited, and waited. Watched pots never boil.  I also told our discussion forum what was going on, because some of our more technical users are interested.

Incident 4

One of the other IMAP servers, this time in Iceland!

We have a complete live-spare datacentre in Iceland. Eventually it will be a fully operational centre in its own right, but for now it’s running almost 100% in replica mode. There are a handful of internal users with email hosted there to test thing (indeed, back at the start, my “urgent” change which meant I was still on call was changing how we replicate filestorage backends around so that adding a second copy in Iceland to be production rather than backup didn’t lead to vast amounts of data being copied through the VPN to New York and back while I was away)

So this was a Dell R510 – one of our latest “IMAP toasters”. 12 x 2Tb SATA drives for data (in 5 x RAID1 and 2 hot spares) for 20 x 500Gb cyrus data stores, 2 x SSD in RAID1 for super-hot Cyrus metadata, 2 x quadcore processor, 48 Gb RAM. For about US $12k, these babies can run hundreds of thousands of users comfortably. They’re our current sweet spot for price/performance.

No console of course, and no kdump either. I hadn’t set up the Iceland servers with kdump support yet. Doh.

One thing I did do was create an alias. Most of our really commonly used commands have super short aliases. utils/FailOver.pl is now “fo”. There are plenty of good reasons for this.

Incident 5

One reason is phones. This was Saturday June 30th, and I’m taking the kids to a friend’s cabin for a few days during the holidays. They needed new bathers, and I had promised to take them shopping. Knowing it could be hours until the next crash, I set up the short alias, checked I could use it from my phone, and off we went. I use Port Knocker + Connect Bot from my Android phone to get a secure connection into our production systems while I’m on the move.

So incident 5 was another blade failing – about 5:30pm Oslo time, while I was in the middle of a game shop with the kids. Great I thought, kdump will get it. I ran the “fo” command from my phone, 20 seconds of hunt and peck on a touch keyboard, and waited for the reboot. No reboot. Couldn’t ping it.

Came home and checked – it was frozen. Kdump hadn’t triggered. Great.

As I cooked dinner, I chatted with the sysadmins from other Opera deparements on our internal IRC channel. They confirmed similar failures all around the world in our other datacentres. Everything from unmodified Squeeze kernels through to our 3.2.21 kernel. I would have been running something more recent, but the bnx2 driver has been broken with incorrect firmware since then – and guess what the Dell blades contain. Other brands of server crashed too, so it wasn’t just Dell.

Finding a solution

The first thing was to disable console blanking on all our machines. I’ve been wanting to do it for a while, but I down-prioritised it after getting kdump working (so I thought). By now we were really suspicious of ntp and the leap second, but didn’t have “proof”. A screen capture of another crash with ntp listed would be enough for that. I did that everywhere – and then didn’t get another crash!

Another thing I did was post a question to superuser.com. Wrong site – thankfully someone moved it to serverfault.com where it belonged. My fault, I have been aware of these sites for a while, but not really participated. The discussion took a while to start up, but by the time the kids were asleep, it had exploded. Lots of people confirming the issue, and looking for workarounds.  I updated it multiple times as I had more information from other sysadmins and from my own investigations.

Our own Marco had blogged about solutions to the leap second, including smearing, similar to what Google have done.  I admit I didn’t give his warnings the level of attention I should have – not that any of us expected what happened, but he was at least aware of the risk more than some of us cowboys.

Marco was also on IRC, and talked me though using the Perl code from his blog to clear the leapsecond flag from the kernel using adjtimex. I did that, and also prepared the script a bit for more general use and uploaded it to my filestorage space to link into the serverfault question where it could be useful to others.

By now I was the top trending piece of tech news, and scoring badges and reputation points at a crazy rate. I had killed NTP everywhere on our servers and cleared the flag, so they were marching on oblivious to the leap second – no worries there. So I joined various IRC channels and forums and talked about the issue.

I did stay up until 2am Oslo time to watch the leap second in.  In the end the only thing to die was my laptop’s VPN connection, because of course I hadn’t actually fixed the local end (also running Linux).  There was a moment of excitement before I reconnected and confirmed that our servers were all fine.  10 minutes later, I restarted NTP and went to bed.

The aftermath: corruption

One of the compute server crashes had corrupted a spamassassin data file, enough that spam scanning was broken.  It took users reports for us to be aware of it.  We have now added a run of ‘spamassassin –lint’ in startup scripts of our compute servers, so we can’t operate in this broken state again.

We also reinstalled the server.  We reinstall at almost any excuse.  The whole process is totally automated.  The entire set of commands was

# fo -a    (remember the alias from earlier)
# srv all stop
# utils/ReinstallGrub2.pl -r     (this one has no alias)
# reboot

and about 30 minutes later, when I remembered to go check

# c4 (alias for ssh compute4)
# srv all start
# c1 (the pair machine for compute4 is compute1)
# fo -m (-a is “all”, -m is “mastered elsewhere”)

And we were fully back in production with that server.  The ability to fail services off quickly, and reinstall back to production state from bare metal, is a cornerstone of good administration in my opinion.  It’s why we’ve managed to run such a successful service with less than one person’s time dedicated to sysadmin.  Until 2 months ago, the primary sysadmin has been me – and I’ve also rewritten large parts of Cyrus during that time, and worked on other parts of our site too.

The other issue was imap3 – the imap server which crashed right back during incident 2.  After it was up, I failed all Cyrus instances off, so it’s only running replicas right now.  But the backup script goes to the primary location by default.

I saw two backup error messages go by today (while eating my breakfast – so much for being on holiday – errors get sent to our IRC channel and I was still logged in).  They were missing file errors, which never happen.  So I investigated.

Again, we have done a LOT of work on Cyrus over the years (mostly, I have), and one thing has been adding full checksum support to all the file formats in Cyrus.  With the final transition to the twoskip internal database format I wrote earlier this year, the only remaining file that doesn’t have any checksum is the quota file – and we can regenerate that from our billing database (for limit) and the other index files on disk (for usage).

So it was just a matter of running scripts/audit_slots.pl with the right options, and waiting most of the day.  The output contains things like this:

# user.censored1.Junk Mail cyrus.index missing file 87040
# sucessfully fetched missing file from sloti20d3p2

# user.censored2 cyrus.index missing file 18630
# sucessfully fetched missing file from slots3a2p2

 The script has enough smarts to detect that files are missing, or even corrupted.  The cyrus.index file contains a sha1 of each message file as well as its size (and a crc32 of the record containing THAT data as well), so we can confirm that the file content is undamaged.  It can connect to one of the replicas using data from our “production.dat” cluster configuration file, and find the matching copy of the file – check that the sha1 on THAT end is correct, and copy it into place.

The backup system knows the internals of Cyrus well.  The new Cyrus replication protocol was built after making our backup tool, and similar checks the checksums of data it receives over the wire.  At every level, our systems are designed to detect corruption and complain loudly rather than blindly replicating the corruption to other locations.  We know that RAID1 is no protection, not with the amount of data we have.  It’s very rare, but with enough disks, very rare means a few times a month.  So paranoia is good.

Summary

All these layers of protection, redundancy, and tooling mean that with very little work, even while not totally awake, the entire impact on our users was an ~15 minute outage for the few users who were primary on imap3 (less than 5% of our userbase) – plus delays of 5-10 minutes on incoming email to 16.7% of users each time a compute server crashed.  While we didn’t do quite as well as Google, we did OK!  The spam scanning issue lasted a lot longer, but at least it’s fully fixed for the future.

We had corruption, but we have the tools to detect and recover, and replicas of all the data to repair from.

Posted in Technical. Comments Off

Changes to web interface Address Book rolled out

We’ve just rolled out a few changes to the address book available in the web interface. These changes are based on some analysis we did of how people are using the address book.

  • Remove the "Description" field on address locations

    We found that very few people used this field (the vast majority were blank), and for those that did put something in here, it was usually duplicate information (either just the string "Home", "Work" or "Other", and just duplicating the selected address type) or confused information (duplicating the first line of the address itself). So we’ve removed this, and in the few cases it appeared to be used, we’ve moved the information into the first line of the address itself.

  • Remove all custom fields

    Few people were using custom fields, and in the majority of cases people were actually putting data in here that should have been in another location. Most custom fields were some form of phone number and those should clearly be in the phone contacts section. The likely reason for this happening is because the previous interface didn’t make it obvious where to put phone numbers, which we’ve now also made clearer (see below). The other use of custom fields was for new services like Skype and Twitter. We’ve added new contact types for those services.

    Any existing custom fields have been moved to the appropriate phone/email/online contact type, or where we couldn’t identify an appropriate type, we’ve moved the data into the Notes section.

  • Add new contact types for Skype and Twitter

    Apart from the phone types, these were clearly the most used custom field types, so we’ve added these as explicit online contact types.

  • Split the old Contacts section into 3 separate sections: Email, Phone and Online contacts

    Because we’ve always allowed an arbitrary number of "contacts", there was a single Contacts section where you could select the contact type you wanted to add: Email, Phone, Web, Instant Messenger, etc. However because the selection  of which type to add was via a pop up menu which defaulted to "Email", it wasn’t actually obvious that you used the same section to add phone numbers, web addresses, etc.

    So we’ve now split this into three separate sections for "Email", "Phone" and "Online" contact types.

Based on our analysis, we believe these changes make the address book easier to use and also better matches the actual data people are wanting to see and store, while removing unneeded and rarely used complex or difficult to understand features.

Posted in News. Comments Off

Changes to webmail login

TL;DR: We’re now making all connections to the Fastmail web interface immediately redirect to a secure (https) connection.

As part of our commitment to making all connections between users computers and our servers secure and encrypted, we’ve just made some changes to our webmail login page. In most cases, users won’t notice any change because we made Secure Login the default almost a year ago. The new changes will only affect the small number of users that have special login requirements.

The main change we’re making is that where previously we would redirect from an insecure (http) to secure (https) connection during login, or on returning to Fastmail on a computer you’d logged in via before, we will now redirect to the secure login screen immediately when you connect to Fastmail. That is, as soon as you go to http://www.fastmail.fm (insecure) or http://www.sent.com (insecure), we’ll always redirect to https://www.fastmail.fm (secure).

Going to other https:// domains that aren’t supported (e.g. https://www.sent.com, a secure connection, but will report a certificate error) will redirect to https://www.fastmail.fm as well.

This will also be the case for businesses and families that use their own domain for logging in (e.g. http://mail.digitalintegrity.com), they’ll also be redirected to https://www.fastmail.fm, but we will continue to correctly show the family/business login screen.

There are a couple of additional exceptions to this.

The mobile UI domains that start with the http://m. prefix like http://m.fastmail.fm (insecure) and http://m.sent.com (insecure) will redirect to https://m.fastmail.fm (secure). This will always show the mobile login screen and mobile interface when you login.

The special "sticky ssl" domains that start with the https://ssl. prefix like https://ssl.fastmail.fm (secure) and https://ssl.sent.com (secure, but certificate warning) will "stick" to that domain. This may be useful as a work around for some proxies that block hostnames with the word "mail" in them.

If for some reason you need to use an insecure login (which we highly recommend you do not do), you will explicitly need to go to the URL http://insecure.fastmail.fm. If you use this to login, data sent between your computer and our server will travel unencrypted over the Internet. This service is only provided for dire circumstances, is highly discouraged, and may be removed in the future.

For the curious, here’s a list of all the transitions that should happen. The "(W)" means you’ll see a certificate warning about mismatched hostnames.

https://www.fastmail.fm               -> stays at https://www.fastmail.fm
http://fastmail.fm                    -> https://www.fastmail.fm
http://sent.com                       -> https://www.fastmail.fm/?domain=sent.com
http://www.fastmail.fm                -> https://www.fastmail.fm
http://www.sent.com                   -> https://www.fastmail.fm/?domain=sent.com
https://fastmail.fm                   -> https://www.fastmail.fm
https://sent.com (W)                  -> https://www.fastmail.fm/?domain=sent.com

http://mail.digitalintegrity.com      -> https://www.fastmail.fm/?domain=digitalintegrity.com
https://mail.digitalintegrity.com (W) -> https://www.fastmail.fm/?domain=digitalintegrity.com

http://m.fastmail.fm                  -> https://m.fastmail.fm
http://m.sent.com                     -> https://m.fastmail.fm/?domain=sent.com
https://m.fastmail.fm                 -> stays at https://m.fastmail.fm
https://m.sent.com (W)                -> https://m.fastmail.fm/?domain=sent.com

http://ssl.fastmail.fm                -> https://ssl.fastmail.fm
http://ssl.sent.com                   -> https://ssl.sent.com/ (W)
https://ssl.fastmail.fm               -> stays at https://ssl.fastmail.fm
https://ssl.sent.com (W)              -> stays at https://ssl.sent.com/ (W)

http://insecure.fastmail.fm           -> stays at http://insecure.fastmail.fm
http://insecure.sent.com              -> stays at http://insecure.sent.com
Posted in News, Uncategorized. Comments Off

New domain: fastmail.nl

We’ve just added fastmail.nl (.nl is the TLD for the Netherlands) to the list of our available domains. That means you can can now signup an account or create an alias on the Options –> Aliases screen (subject to your account service level) in this domain.

Along with our primary domain fastmail.fm, this adds to our existing list of available “fastmail” TLDs.

fastmail.cn
fastmail.co.uk
fastmail.com.au
fastmail.es
fastmail.in
fastmail.jp
fastmail.net
fastmail.to
fastmail.us

Posted in News. Comments Off

Enforcing SSL/TLS encryption of all connections

Users regularly tell us how important the security and privacy of their email account is. Sometimes because of how their email software was initially configured, users don’t realise that their username and password are being sent over the Internet unencrypted, which is often a genuine surprise and concern.

Because of this, we have decided to enforce that all communication between a user’s email software and our servers is encrypted, ensuring that no one can eaves drop on your username or password to steal your login credentials.

If we detect that you are currently using an insecure (non-SSL/TLS) connection to send or receive email, we will send you a notification directing you to this page which explains how to fix your email software. You will keep receiving this message until you have successfully fixed your configuration.

We will be rolling out these changes over the next few weeks and will give people until the end of June to change their software. We believe these changes are in the best interests of all users and are modern best practice on the Internet these days.

Posted in News. Comments Off

Understanding SSL vs TLS vs STARTTLS

There’s often quite a bit of confusion about the different terms SSL vs TLS vs STARTTLS. To help explain the differences and a bit of the history behind these terms (especially with regard to email protocols), I’ve put together a help page that I hope is useful for people.

http://www.fastmail.fm/help/technology_ssl_vs_tls_starttls.html

Posted in News. Comments Off

Singapore proxy server discontinued

Some years ago, when connectivity within the pacific region was less reliable, we added a small proxy server in Singapore which forwarded sessions down a VPN connection to our datacentre in New York.

The world has moved on, and this service is barely used. Reading the logs it’s almost all search engines scanning our help pages, which is just going to direct people to the slow proxy copies rather than the originals.

So as of today, the sg.* hostnames point directly to our main New York addresses, and the Singapore proxy will be shut down.

Posted in News. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 3,164 other followers