Diary of an outage

As some of you are no doubt aware, yesterday we had a fairly serious outage. It only affected a small number of users, but for them it meant some some 4-6 hours with partial or no access to mail. In this technical post I’ll be explaining exactly what happened and what we are doing to fix the problem and ensure it doesn’t happen again in the future.

For the non-technical readers, you can skip to the last paragraph.

Early morning page

At around 4:30am Melbourne time (around 17:30 UTC) I was paged by our automated monitoring tools. We have extensive monitoring of most aspects of our infrastructure, and the system has permission to wake people up if it notices a problem. In this case, it noticed that one of our many backend mail servers, imap21, was no longer reeachable. Unlike many other service providers we don’t do automatic failovers because if the software makes the wrong decision it can lead to worse problems than might otherwise occur. We prefer to put a human into the loop to make sure things look sane before taking any action.

A machine failure of some sort is not as uncommon as you might think. Usually it’s a single disk failure, but we’ve variously had disk controllers fail, power supplies fail, and kernel or other bugs cause machines to crash. In the event of a crash, the machine usually reboots. The normal procedure for a night-time failure is to failover all mail stores on the machine to replicas on other machines. That’s anywhere between 15 and 40 stores depending on hardware configuration. We prefer to return service to the original machine if possible so we can maintain the replicas and thus keep our redundancy level, but that’s usually not possible in the case of a hardware fault – the machine is no longer fit for service.

The failover process involves shutting down both the current “master” slot and the target replica slot, reconfiguring them such that the master becomes the replica and the replica the master, and then starting them both up. The system configuration database is also updated so that all other services (web client, IMAP/POP frontends, mail delivery) know where to find the stores now. This is all done by a scripted process that can be initiated with just one command. In the event that the current master slot is unavailable, that part is skipped.

This morning imap21 failed in a fairly serious way. It was not available on the network. The management console showed that it tried to reboot, found corruption on the operating system disk, tried to repair it and couldn’t and was waiting for manual intervention. That immediately told me this was a serious problem and the best course of action was not to try and repair it, but instead failover its 15 slots to replicas. I did this, checked our monitoring to confirm all systems were operating normally, and went back to bed. I knew that full repairs would potentially be a big job, and I wanted enough sleep to be able to do it without messing things up further. As far as I was able to tell service had been fully restored to users, which is the single most important thing.

Replication logs

Fast forward to 9am. I arrived at the office after a bit more sleep at home and on the train (you take what you can get!). On arrival I was told that a number of users had been reporting either missing mail or errors on login. The first thing I thought of turned out to correct: at least one of the replicas I had failed over to were not up to date. But first we had to prove it!

The Cyrus replication model is an “eventually consistent” one, fairly common among database server software (which is what Cyrus is, though optimised for email storage). When some action is performed (delivering a message, creating a folder, moving a message, etc), the operation is performed on the local data store, and a record of the operation is written to a replication log. A separate pair of programs (sync_client and sync_server) use that log to perform the same actions on the replicas. Typically that happens in near realtime, but if replication was in fact behind, then we should see lots of unprocessed lines in the replication log.

Getting to the replication log was something of a problem because as noted above, imap21 refused to boot. A filesystem check was not pretty – literally thousands of errors. That’s quite worriesome as this is a root disk which generally shouldn’t have any corruption on it (it almost never recieves writes), but figuring out why the crash happened is something of a secondary concern at this point. So I reinstalled the OS into the root partition. That’s something we do all the time and it’s fast and accurate. Twenty minutes later it was fully reinstalled and the machine was up and running, and we got to inspect the replication logs.

The good news was that 14 of the 15 stores on this machine had fully up-to-date replicas, so there was only one mail store to deal with. The bad news is that the remaining store had some 2-3 days worth of unreplicated events in it’s log. To add more pain to the situation that store was coincidentally the current “new user” store, where new users are created. That means that any users that had signed up in the last 2-3 days did not exist at all on the replica, thus the reported login errors. A horrible first experience!

Examining the log, we discovered that right at the top there was a complex series of folder renames within a single replication event. This is not a particularly unusual operation. This time it tripped a known, rare bug in the way renames are replicated that caused the replication process (sync_client) to abort. The Cyrus master daemon starts it up again, but then it hits the same point and dies again, over and over. Replication stops.

Fixing this is on our list of things to do. Because it’s a fairly rare thing to happen and usually gets dealt with quickly, it hasn’t quite made it far enough up our list to deal with. Obviously that has now changed, but lets talk about why it wasn’t dealt with this time.

Monitoring the monitors

As noted, we have a lot of system monitoring running, and we place a huge amount of faith in it. It’s almost by definition that if there’s no problems being reported, then there are no problems, at least in overall system health. Of course individual per-user problems can appear from time to time, which is unfortunate. A support ticket is the usual way to get to the bottom of individual problems like that.

Something we monitor is replication lag, which is almost entirely based on the size of the replication logs. If they grow “too large”, a low-level warning is produced. A warning of this type results in a message being posted to our IRC channel and an email being sent, but it will never generate a page. It’s the kind of thing that we look at and action every now and again, and as noted it’s a fairly rare event under normal operation.

There is however one time where that warning can occur yet not be a problem, and that’s when users are being moved around a lot. User moves are done via the replication system, and when you’re doing a lot of them at once it can generate a lot of replication log traffic, sometimes causing replication to lag significantly for short periods of time. Something we’ve been doing for the last few months is redesigning our disk layout into something much easier to reason about and work with, and that has required a lot of moves. This particular warning has not been as uncommon as it should have been recently, leading to the situation where we’ve started ignoring it.

Obviously this is a dangerous place to be. I recognise how bad that sounds – “system warnings were ignored” – but this is what happens when you have a warning that doesn’t quite match up with the importance of the situation. Think of it like the fuel light on your car. It needs to be calibrated to come on at just the right time. Too late, and you run out of fuel before you can refill. Too early, and you start to learn that it isn’t really a problem; you won’t run out of fuel for ages. At that point you might as well not have the light at all. It’s not a perfect analogy, but it’s instructive.

So far we haven’t figured out the original cause of the filesystem corruption (we suspect a hardware failure that isn’t visible to our normal tools) that led to the crash. But that’s not quite the point. Had appropriate action been taken when the replication lag was first noticed, we would have had fully up-to-date replicas at failover, and this entire situation would have been little more then a ten-minute outage and me being a little tired for the day.

Now what?

Obviously, we have a few things to do out of this!

  • The replication lag monitor needs to be able to send a page when things are getting bad. Our current thinking is that it should page if it doesn’t have a least one replica less than five minutes behind.
  • It also needs to understand that there are other things that can cause lag and compensate, which means it needs to know what user move operations are in progress. That said, replication falling behind is always a little dangerous, so we might be better to somehow change the way moves happen so that we always have at least one viable replica at all times. We’ll need to consider this in more detail.
  • We need to make the replication system able to cope with hard replication failures. One idea we’re currently considering is to put failed replication events into another logfile, and come back to them once the main log is empty. This needs a bit of thought, particularly around the consequences of operations being applied out-of-order.
  • We need to fix the renaming bug. Having a last-ditch protection against replication problems is great, but even better than handling problems is avoiding them in the first place.

Conclusion

It’s now 13 hours since the problem was fully understood and recovery began. We believe that most if not all missing mail and broken accounts have been repaired. If you were unfortunate enough to be affected and you’re still seeing problems, please contact support.

We are very sorry for this outage. We understand how important your email is and how much it affects you when it’s unavailable. We’re proud of our track record on reliabilty but we know we’ve dropped the ball on this one. To our new users, who are the ones most likely affected by this outage, we understand very well how this makes us look. We’re working hard to get the situation resolved and to restore your trust. Thank you all for your patience and understanding.

Posted in Technical. Comments Off

Content Security Policy now on Beta

At FastMail, we’re always looking to increase security for our users. Cross-site scripting (XSS) attacks are one of the dangers that all websites must take care to mitigate against. HTML email is the highest risk for all webmail providers. Before embedding it into a page, it must be carefully checked and any potentially malicious content removed. In particular, all scripting content must be removed otherwise an attacker could gain access to your account and email.

Due to the complex nature of HTML parsing and encoding, there are many ways that a malicious email might try to sneak through scripting content. That’s why we fully parse the HTML first on the server and sanitise it against a white-list of known-good tags and attributes. This ensures that any scripting content is stripped, and other ambiguous content is properly escaped and encoded.

We’re very careful, and we have lots of tests to ensure we protect against all known techniques for trying to embed scripts. However, there’s always a possibility of bugs in any software, and Content Security Policy, also known as CSP, is a new HTTP header that provides an extra layer of defence against these types of attacks.

With CSP, we can instruct all modern browsers to only ever load scripts from our own website. Any references to remote scripts or "inline" scripts will be blocked. This means if a malicious email somehow slips through our filters, the browser still stops it from doing anything dangerous.

We’ve just rolled this out on our beta server (https://beta.fastmail.fm) for testing. We hope to roll it out everywhere soon. If you use our beta server, please let us know of any new issues you notice by emailing betafeedback@fastmail.fm. Some browsers may have issues with extensions. These should be allowed to run according to the spec, but some browsers
may prevent them from doing so as a violation of the content security policy. If you have a problem with an extension at FastMail, please first try updating to the latest version. If the issue still persists, please let us know so we can contact the extension authors.

Posted in Technical. Comments Off

Secure SSL/TLS access to LDAP and DAV now mandatory

Over the last few years we’ve been phasing in mandating SSL/TLS encryption on all connections between user machines and our servers, ensuring that no one can eaves drop on your username or password to steal your login credentials.

We’re continuing with that process today by disabling non-SSL/TLS access to all LDAP and DAV services. We emailed everyone we believe that was using these services a few weeks ago to inform them of the upcoming change.

This means if you use LDAP to access your address book, you must use port 636 with SSL/TLS enabled.

If you use DAV to access files in your file storage, you must use https://dav.messagingengine.com, not http://dav.messagingengine.com (note the additional “s” in https://).

Posted in News. Comments Off

Calendar now available on beta.fastmail.fm for testing

We’re very excited to have released the web UI for our new calendar on to our beta server for public testing at https://beta.fastmail.fm.

To access, simply log in to your FastMail account at https://beta.fastmail.fm and select "Calendar" from the menu in the top left.

Note: The calendar is only available when you log in to the new user interface (the default). If you use the “Classic” user interface (by explicitly selecting it at login time or because you have an older browser like IE6/7 which doesn’t support the new interface) the calendar will not be available. We currently have no plans to port the calendar to the classic user interface.

Tips and tricks

  • Open the settings and select the new Calendar panel to enable a few advanced features, create new calendars and make sure your time zone is set correctly
  • You can drag and drop events to move them (or hold down alt whilst dragging to copy).
  • There are keyboard shortcuts for navigating (try j/k, or hit g), and also for the buttons in the action bar at the top (hover over the button to get a tooltip with the shortcut).

Sync to mobile/calendar software (CalDAV)

You can also sync your calendar with your mobile device as long as your device supports the CalDAV protocol (iOS supports it natively, Android requires a separate program, CalDAV-Sync works well, it costs around US$2). The required details are:

Most clients should correctly auto-discover your calendars. If that fails, you might need to setup the full CalDAV path in your client which is:

Make sure you replace fullfastmailusername@domain with your full FastMail username and domain.

Access restrictions

Not all service levels have access to the Calendar features.

Web UI

Available to all levels except legacy Guest and Member accounts (at the moment, the link is shown but events will not save; we will improve the UI here to make it clear that the calendar is not available to these service levels)

CalDAV

Available to:

  • Enhanced and Premier level accounts (Personal/Family)
  • Standard, Professional and Enterprise level accounts (Business)

That is, these levels do not have CalDAV access: Guest, Member (legacy). Lite, Full (Personal/Family). Basic (Business)

Current known issues and missing features

We’re actively working on all these issues.

  • There are currently some layout issues in older browsers. Supported browsers for now are: Chrome 21+, Firefox 22+, Opera 12.1+, Safari 6.1+, IE9+.
  • Alerts do not work yet (you can set them in the event editor, but no alert will be shown when they are triggered).
  • Email reminders do not work yet (you will not be sent an email).
  • Emails are not yet sent out to people invited to an event.
  • An email is not currently sent when you respond to an invitation in the calendar view.
  • Integration with the mail part of our web UI (save attached event to calendar etc.)
  • Support for files attached to events.
  • Support for calendar sharing between users in a family/business
  • Support for subscribing to public iCal files and showing them in your calendar.
  • An easy way to import/export calendar data.

Please post bug reports, feature suggestions or other comments either in the forums or email us at betafeedback@fastmail.fm. We can’t respond to everything, but we do read it all.

Posted in News. Comments Off

Increased storage quotas and other service level changes

we’ve made some changes to our service levels. These changes simplify pricing, unify personal and family service levels, make it easier to migrate from other services to FastMail and give the vast majority of users an increase in storage at no extra cost.

All accounts

All prices have been rounded to the nearest dollar. The existing $x.95 pricing on all accounts made comparing prices more difficult for users, so we’ve changed to just using whole dollar pricing on all accounts to make comparisons simpler and clearer.

Personal accounts

  • Ad Free has been renamed to Lite. Now that we don’t have free Guest accounts, the name is an anachronism since no accounts have advertising.
  • All Lite (previously Ad Free) and Enhanced accounts have increased email and file storage quotas. Full accounts have increased file storage quotas.
  • The price of the Lite account is now $10/year. We will email all existing Ad Free users about this change shortly with information about how to lock in the existing pricing for some years.

The complete list of new quotas and names are:

    Old quota New quota
Lite
(was Ad free)
Email:
Files:
100 MB
2 MB
250 MB
100 MB
Full Email:
Files:
1 GB
100 MB
1 GB
1 GB
Enhanced Email:
Files:
10 GB
2 GB
15 GB
5 GB
Premier Email:
Files:
60 GB
30 GB
60 GB
30 GB

Family accounts

We’ve renamed all the family service levels to be the same as the personal service levels, and also have the same quotas as the corresponding personal service level.

This means that if you manage more than one account, it’s much easier to switch to a family account. There’s no concern about having slightly different quotas and having to deal with all the different service level names, it’s a straight forward conversion from a personal account to the corresponding family account.

So now the main differences with family accounts compared to personal accounts are:

  • A single billing cycle and credit card for all accounts in the family
  • Add/change/delete accounts in the family from your management accounts at any time
  • Ability to have account names in your own domain (e.g. john@yourdomain.com)
  • Ability to have your own login screen at http://mail.yourdomain.com
  • It’s an extra $5/year for the family "container"

The complete list of new quotas and names are:

    Old quota New quota
Lite Email:
Files:
200 MB
6 MB
250 MB
100 MB
Full
(was Everyday)
Email:
Files:
800 MB
600 MB
1 GB
1 GB
Enhanced
(was Superior)
Email:
Files:
8 GB
6 GB
15 GB
5 GB
Premier
(new)
Email:
Files:
N/A
N/A
60 GB
30 GB

Business accounts

All Basic, Standard and Professional accounts have increased email and file storage quotas. The complete list of new quotas are:

    Old quota New quota
Basic Email:
Files:
250 MB
2 MB
500 MB
100 MB
Standard Email:
Files:
1.5 GB
100 MB
2 GB
1 GB
Professional Email:
Files:
15 GB
6 GB
25 GB
10 GB
Enterprise Email:
Files:
150 GB
60 GB
150 GB
60 GB

We’ll be emailing these details to all users shortly.

Posted in News. Comments Off

Apple mail “bug” turns out to be user script after all

This is a technical blog post which gives updates on the previous post. There is no need to read this unless you’re particularly interested in IMAP protocol issues and/or gossip.

Of course it’s hard to give updates quite the same level of press that the original post got. The internet does outrage far better than it does corrections.

The OS X Mail team have been really helpful in tracking this down. I put them in direct contact with the user to do further debugging. A very embarassed user discovered an applescript he wrote years ago to move mail from OS X Mail’s “semantic junk” folder to the real Junk folder at FastMail where our bayes trainer could learn from it.

What has changed is that OS X Mail now correctly detects the \Junk special-use on the folder at our server, and sets the semantic trash to be that folder – meaning he was moving messages from that folder to the same folder. Which raises another interesting question:

Could we stop this happening again?

There are three main places I can see that it could have been stopped:

  1. OS X Mail when it detected a “move” from a folder to the same folder and not performing it
  2. FastMail detecting a copy to the same folder and rejecting it
  3. FastMail detecting that the same message had occurred “too many times” in the past and rejecting the copy

Interestingly, none of these would have stopped the infinite loopyness. The script was pretty basic, and would have found messages in “Junk Mail” every time it ran.

There’s no way to avoid infinte loops in a sufficiently powerful language. You solve the “halting problem” when the user’s computer catches fire due to excess CPU usage, but otherwise it will just sit there eating power and making the computer feel slow.

With option (1) the reject would have been local, and minimal cost.

Option (2) costs more – because it causes a network round-trip each time, and it breaks theoretically useful usage (duplicating an email… not entirely sure how that’s useful in the case where you KNOW that it’s the same message – but the standard doesn’t say to reject it).

Option (3) is less breakage than option (2), because it only blocks the copy when it detects multiple past copies – but it’s expensive for the server – Cyrus doesn’t keep an index by GUID on the mailbox, so it would have to detect “copy to self” and do a full pass keeping a counter for each message and seeing if there were already “too many copies”. It’s more code and more complexity. It does bring a level of robustness against weird behaviour though.

Who should fix it?

Obviously, I think that client authors should be careful in what they send. They have the most information about the users’ intent, and the most ability to give useful feedback to the user about why the action is being rejected.

But I know it’s tilting at windmills. The OS X Mail team have been fantastic now that I’m in contact with them, but they are one of many clients, and it only takes one.

For now, I’ve put in a syslog statement that tells me how often clients are copying/moving messages to the same folder. The answer is – quite a lot. Some are OS X Mail (including older versions). Some don’t identify themselves, but I’ve found at least Outlook 14 in the “X-Mailer” fields of Sent Messages from the same IP address… so that’s a clue.

I’m guessing I really have to implement some variant of (3) if I want to not break any existing users. Hopefully the OS X Mail team will implement (1).

An aside on “detecting what’s happening at the server”

OS X Mail isn’t using the recently standardised “MOVE” command, it’s using the “COPY” command followed by “EXPUNGE” on the source messages – which is part of the original standard, and works on every server.

If we answered success to the COPY but didn’t actually create new copies of the message, it would mean that the messages would be silently deleted a moment later. Handy in this case, but hardly good behaviour in general! A user could accidentally wipe their entire INBOX.

This is a general problem that I have with IMAP. It’s a very imperative language. Rather than describing what you want done, you send explicit instructions against a very rigid model. Not only that, but there have been tons of extensions. You can use them if they exist – but a server can advertise itself as “IMAP” and support none of them, not even really useful basic things like UIDPLUS. Check this out:

http://imapwiki.org/Specs

Which means that clients either have to have two codepaths, one for servers with the extension, and one for servers without. Either that or just use the “without” codepath for every server.

If clients do have two different codepaths, the chance of bugs is more than doubled, because they will interact in interesting ways. Another issue I debugged this week was an issue with Thunderbird 24.0.1 not noticing that messages had been deleted from the server. The server was sending the right data, but Thunderbird wasn’t recognising it. Turning off support for CONDSTORE fixed it, which is why it “worked” on other servers which didn’t have CONDSTORE support, or didn’t have it switched on.

So the clients try to convert the intentions of the user into changes on the server, and they have to do so through a set of very small commands which each do one thing. There’s no way for the client to say “I need all the following information”, or “here’s the transformation do the data model that the user requested”. And then the server, if it wants to optimise at all, has to guess the intent from the commands as they come in. Some servers try to read ahead to see if the client has pipelined (made multiple requests before waiting for the answers). Our server doesn’t – it just takes advantage of the fact that at least the files are “hot” in memory, so accessing them will be faster next time.

Some examples:

Unread counts? Have to ask for them separately for every folder. If you want to display all the folders with unread messages, you need to first get the full listing of folders, then for each one request the unread count in a separate command.

Want the “first text part of every message”? Good luck – you need to fetch the bodystructure for each, and either combine those for which the text part is the same number into separate batches, or just do one fetch per message.

So it’s not surprising that IMAP clients have bugs. There is some necessary complexity to the problem of accessing email remotely – but plenty of the complexity has been caused by standards which have the client doing too much telling the server “how to do it” rather than “what needs to be done”.

Posted in Technical. Comments Off

Mac OS 10.9 – Infinity times your spam

UPDATE: the cause of the “infinite copy” bug was tracked down to a user side applescript. I have written a separate blog post about it.

This is a technical blog/rant. Users of FastMail who don’t use Apple’s email clients can safely skip it.

I’ve spent quite a lot of the past few days dealing with bugs caused by the Mail app in Apple’s new Mavericks update.

Apple’s mail client has been buggy with IMAP connections forever. It was infamous a couple of years ago for creating folders called INBOX.INBOX.INBOX.INBOX.INBOX.INBOX (until it hit the mailbox limit). We now block those at create time because they were causing “interesting” problems as well as being confusing.

This release has other interesting bugs that I’ve looked into in the past couple of days. When you uncheck the “keep a copy of my sent email on the server” checkbox, it auto-rechecks itself. I confirmed that report myself on our test laptop.

I also confirmed that the ‘Archive’ folder (special-use \Archive) doesn’t appear in the folder listing, and neither do any subfolders if you have any (one of my accounts does – one per year for the past *mumble* years).

So we know it’s not the breakfast of champions. That’s not a giant surprise, it’s never been the strong point of that OS. The previous revision has problems too:

229.12 UID STORE 588201 +FLAGS.SILENT ($Junk Junk JunkRecorded)
229.12 OK Completed 
230.12 UID STORE 588201 -FLAGS.SILENT ($NotJunk NotJunk)
230.12 OK Completed 
231.12 UID EXPUNGE 588201
231.12 OK Completed 
232.12 UID STORE 588201 +FLAGS.SILENT (\Deleted)
232.12 OK Completed 

Anyone who can read IMAP can see that it tries to expunge the message BEFORE the \Deleted flag is set, which is pointless. UID EXPUNGE only deletes messages with the \Deleted flag set.

http://tools.ietf.org/html/rfc4315#section-2.1

I found _that_ one because our web interface expunges all \Deleted items, so users noticed they only got the expected behaviour across multiple clients if they left the web interface open at the same time.

Millions of messages

But this doozy takes the cake. I found it nearly a week ago when we had an IO error because a user’s cache file was overflowing the 32 bit offset counter that still exists in Cyrus (I have an experiemental branch with 64 bit offsets, but it’s not ready for production yet)

They had 71 unique messages in their Junk Mail folder, but included expunged messages (we keep them for a week for backup purposes) there were over 1 MILLION separate entries in the index file. We de-duplicate on store, so the fact that there were over 100,000 copies of the most prolific message in the index didn’t totally flood the disk.

I noticed then that they were using 10.9′s mail app:

3.19 ID ("name" "Mac OS X Mail" "version" "7.0 (1816)" "os" "Mac OS X" "os-version" "10.9 (13A603)" "vendor" "Apple Inc.")

I wiped the expunged messages from the cache, emailed the user, and left it at that.

This morning I checked again, there were nearly a million messages again, so I enabled telemetry on the account and emailed the user a second time.

Here’s what I saw in the telemetry (one of many):

44.18 SELECT "INBOX.Junk Mail" (CONDSTORE)
[...]
45.18 FETCH 1:* (FLAGS UID MODSEQ) (CHANGEDSINCE 213634)
45.18 OK Completed (0.000 sec) 
46.18 IDLE
+ idling 
DONE
46.18 OK Completed 
47.18 CHECK
47.18 OK Completed 
48.18 UID COPY 3360991:3361069 "INBOX.Junk Mail"
* 158 EXISTS 
* 79 RECENT
48.18 OK [COPYUID 1318456612 3360991:3361069 3361070:3361148] Completed

Yes you read that right. It’s copying all the email from the Junk Folder back into the Junk Folder again!. This is legal IMAP, so our server proceeds to create a new copy of each message in the folder.

It then expunges the old copies of the messages, but it’s happening so often that the current UID on that folder is up to over 3 million. It was just over 2 million a few days ago when I first emailed the user to alert them to the situtation, so it’s grown by another million since.

The only way I can think this escaped QA was that they used a server which (like gmail) automatically suppresses duplicates for all their testing, because this is a massively bad problem.

I discovered my second attempt to contact the user about this problem in their Junk folder tonight. 10 times already!

Given that my colleague had just been paged by high disk usage on that user’s server – a usage which was growing fast, and which got reduced massively by expunging old deleted messages… it was time to lock the account until the faulty email client is disabled. We don’t lock user accounts lightly, but running a malfunctioning piece of software which is affecting other users and resisting attempts to contact qualifies, and a promise to disable the faulty software will be enough to resume service.

Yes, Mail.app was cleaning up after itself, but we keep deleted emails for a week, and even though it wasn’t using disk space, over a million emails were using enough meta database space that a disk had filled up. There are only 79 actual emails in this folder with a total usage of about 2MB, yet the meta files:

91M cyrus.index
906M cyrus.annotations
1.2G cyrus.cache

Over 2GB of disk usage.

Solving problems

The sad thing is – there are about 600 copies of Cyrus on our production farm, and I can roll out a new copy in about 5 minutes. There are umpteen million copies of Mail.app out there, and I can’t get them fixed on any particular schedule – so if this happens with more than one user the only real solution that I have is to code a workaround directly into our server to protect our other users.

We already wrote a special case for another one of Apple’s brilliant ideas – make the search box default to a full text search of every mailbox. The most expensive possible option for the server.

We have rate limits for other things, but we’ve never considered needing a rate limit for the COPY command – it would usually hit quota, but because these messages are expunged as fast as they are created, the quota doesn’t catch this issue.

The 4 million message 32 bit limit of the UID field will become interesting on that folder soon too, which is another thing we’ve never hit in production before. The workaround here is at least known – create a new folder, copy the messages across, delete the old folder, and rename the folder into place with a new UIDVALIDITY and new messages. – as many people have pointed out to me, it’s 4 billion, not 4 million – so much for last minute ideas when writing late at night. Sorry for the confusion. It would take a lot longer then a few days to hit the limit!

Posted in Technical. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 4,622 other followers