More servers installed to deal with spam load

We’ve now setup and installed two new servers to deal with the increased mail delivery load that’s been occurring the last two days. These new servers are powerful dual core Xeon 5130 processor based servers, and so should be able to handle the processing load. We’ve also lowered the queue warning level, so we’ll be paged earlier if there does appear to be any excess email building up so we can deal with it quicker.

With the new servers, we’ve also taken some time to update our SpamAssassin installation to the latest version, and also install the FuzzyOcr plugin to try and deal with the large number of stock “pump-and-dump” image spams that are plaguing Internet email users at the moment.

Posted in Technical. Comments Off

Delayed inbound mail again

It seems all the measures we put in place yesterday weren’t enough. We’ve just been paged by our servers that an inbound delayed mail queue has built up again. We’re seeing what we can do bring it back down.

Update (2:30pm EST): The queues are going down. We’re getting NYI to install 2 new servers ASAP that we previously were going to use for IMAP expansion and try and bring those up to help with the processing.

Update (3:05pm EST): Over half the mail from the queues is now delivered. Any new mail arriving should be delivered within about 1-2 minutes, old mail is being delivered as the queue can be cleared.

Update (4:05pm EST): Almost all of the queues have cleared now and all new mail should be delivered immediately.

Posted in Technical. Comments Off

Email delays yesterday

It seems today the spam attack we’ve been experiencing for the last week intensified even more, and caused email to backup on some of our servers causing some email to be delayed up to several hours. We’ve analysed what happened and believe we now have procedures in place to stop this occurring tomorrow and in the future.

Posted in Technical. Comments Off

Massive increase in email connections to our servers

About 4 days ago or so, something went crazy with the spam zombie machines out there. Previously the spam sending software spammers were using was acting like reasonably well behaved email sending software. It would connect to us, trying to send it’s spam, then disconnect just like any other email sending system on the Internet. They’d do that every now and then, maybe an hour or two between attempts. Still, with 100,000′s of machines, that’s millions of attempts a day to send spam.

Now however, the zombie machines and software have just gone insane and are connecting over and over every few minutes, but mostly doing nothing during the connection. While that in theory might seem fine since they do nothing, it’s not. When you have 200,000+ machines connecting to you every few minutes, even if they do nothing you still have the connections to deal with, the RBL DNS lookups, the rate limiting lookups, etc.

The result was a significant jump in load on the incoming servers, significantly above the load jump we’ve seen over the last couple of months even.

To combat this, we’ve had to invoke some old code from a previous “bombing” attempt we had a while back. This code continuously scans the logs looking for particular aberrant behavior and then put those IPs on a special “early” block list which means as soon as the machine connects, it’s sent a response of:

454 Service temporarily unavailable; Client host [x.y.z.a] blocked using internal list; Access denied

And disconnected. Over the course of a couple of hours and days (as infected computers out there were turned on and off), we’ve built up a list of over 200,000 IPs that are now being “early blocked” like this. To give you an idea of how big the surge is, almost 3/4′s of all connections are now being “early blocked” by this list. That means incoming connections have probably almost tripled in in the last 4 days.

This is also something we can just confirm by the size of our log files. Normally our email processing files are rotated each day, but we’re now having to rotate them multiple times a day because they’re reaching their limit of 2 gigabytes in size!

Our only current worry is that somehow we’ve blocked some other services incorrectly. We’ve had one report from a user who’s scanner has been blocked (it’s an internet enabled scanner that you can setup to email you when you scan something, unfortunately it seems to be designed for LAN networks, and polls the SMTP server you’ve setup every 60 seconds to see if it’s alive, much like the spam zombies are *sigh*).

Some more information about the current spam wave that’s going on is at Extreme Tech.

Update: It seems some badly run sites were being blocked. Some sites with incorrect DNS setup were being identified as “dialup/dsl” machines. Some other sites seemed to be doing the same signature of the spam zombies, namely “connect, do nothing, disconnect”. Some other sites were sending rapidly to many unknown recipients, also a sign of a spam zombie trying to enumerate usernames. We’ve tightened up the blocking criteria some more, removed a number of existing blocks, and put some common hosts on an IP whitelist so they’re not blocked again in the future.

Posted in Technical. Comments Off

Use Ctrl-Shift-Enter to Send message on the Compose screen in Firefox

For several years, FastMail has allowed you to use Ctrl-Enter on the Compose screen as a shortcut to send a message. Unfortunately Firefox 1.5 and 2.0 both have a bug that means this doesn’t work as expected, instead showing the “Download manager” dialog as well as attempting to send the message. This bug in Firefox is documented and being tracked at the bugzilla site here.

Someone just recently noted that using Ctrl-Shift-Enter doesn’t trigger the bug, but because it still does have the Ctrl key pressed, it triggers the correct JavaScript code to send the message. So for Firefox 1.5 and 2.0 users, this is a good work around until this bug is fixed, just use Ctrl-Shift-Enter instead.

Posted in Technical. Comments Off

A number of small updates

A number of small updates have just been rolled out.

  1. When forwarding a number of messages as attachments, those messages would have been marked as read if they were unread. This no longer occurs, message stay unread
  2. When sending an email, on the next screen you’re presented with a list of addresses the email was sent to. Additionally now you’re also told what folder the sent message was saved into if any was specified explicitly or via the personality you used
  3. If you try and setup a Pop Link to a Hotmail account that has DAV access disabled, you’re now given a more informative message
  4. On Windows Mobile based PDAs using Internet Explorer should now be more reliable. In particular certain web operations (most commonly the Reply or Forward buttons on the message read screen) would cause system error messages to be returned. These should be fixed.
Posted in Technical. Comments Off

Domain split

We’ve now rolled out the ‘domain split’.

This means that you can use a username that someone else is using, as long as you choose a different domain.  So now, a lot of previously unavailable username – domain combinations can be used for aliases or new users.

Posted in News. Comments Off

Why replication took time to setup

This is a copy of a post I put in our forum explaining the reason it took us some time to get replication setup.

http://www.emailaddresses.com/forum/showthread.php?s=&threadid=46269

The initial issue that made us realise we had to implement some form of replication occurred in November 2005 last year (http://blog.fastmail.fm/?p=521) when corruption on one of our major volumes caused 3 days of down time. After that, we started working on how we were going to get replication setup. On the whole, the process went slower than expected. I’d put this down to a couple of things:

1. The cyrus replication code wasn’t really production ready.

We knew this when we started, and thought about our options which really were:

  • Use cyrus replication and help bring it up to production readiness
  • Use some other replication method (e.g. block based replication via DRBD – http://www.drbd.org/)

We decided to go with cyrus replication because with block level replication, you’re still not protected from kernel filesystem bugs. If the kernel screws up and writes garbage to a filesystem, both the master and replica are corrupted. Protection against filesystem corruption was one of our major goals with replication.

This wasn’t really that crazy because we knew the main replication code itself came from David Carter at Cambridge (http://www-uxsup.csx.cam.ac.uk/~dpc22/), so the original code was used in a university environment. The problems were really to do with integrating those changes into the main cyrus branch and accommodating other new cyrus 2.3 features, so we thought it wouldn’t be that much work.

Unfortunately it seemed that not that many people were actually using cyrus 2.3 replication, so ironing out the bugs took longer than expected. Additional problems included CMU adding largish new features (modsequence support) to cyrus within the 2.3 branch itself that totally broke replication.

Still, we spent quite a bit of time setting up small test environments for replication and ironing out the bugs along with a few others. Unfortunately even after rolling out, there were still other bugs present and the CMU change that broke replication was damn annoying since it wasn’t immediately obvious and caused some downtime when we had to switch to the replica (basically replication appeared to work fine, but it turned out when you actually tried to fetch a message from the replica, it was empty). After that disaster we implemented some code that allows us do replication “test” on users, to see that what the master IMAP server presents to the world is exactly the same as what the replica IMAP server presents to the world. We now run that on a regular basis.

A few example postings to the cyrus mailing list with some details

http://lists.andrew.cmu.edu/pipermail/info-cyrus/2006-August/023331.html
http://lists.andrew.cmu.edu/pipermail/info-cyrus/2006-July/022595.html
http://lists.andrew.cmu.edu/pipermail/info-cyrus/2006-August/023336.html
http://lists.andrew.cmu.edu/pipermail/info-cyrus/2006-May/021919.html

2. Our original replication setup was flawed

There’s a number of ways to do replication. The most obvious is you have one machine as the master, and a separate one as the replica. That’s a waste however, because the replica doesn’t take as much resources as the master (one writer, no readers). So our plan was to have replica pairs, with half masters on one replicating to have replicas on the other and vice-versa. This would provide better performance in the general case when both machines were up.

The problem with this is it turned out to be a bit inflexible, and when one machine goes down, the “master” load on the other machine doubles. It also means the second machine then becomes a single point of failure until the other machine is restored. Neither of these are nice.

After a bit of rethinking, we came up with the new slots + stores architecture (see Bron’s posts elsewhere). Basically everything is now broken into 300G “slots”, and a pair of these slots on 2 different machines makes a replicated “store”. The nice thing about this approach is that:

  • Each machine runs multiple cyrus instances. Each instance is smaller, can be stopped & started independently, can be moved easier, restored more quickly from backup if needed, volume checked more quickly, etc. Smaller units are just easier to deal with
  • By spreading out each store pair to different machines, when one machine dies, the load is spread out to all the other servers evenly
  • Even after one machine dies, a second machine dieing would only affect maybe one or two slots, rather than a whole machines worth of users

The downside to this solution is management. There’s now many, many slots/stores to deal with, which means we had to write management tools.

Had we gone with this from the start, it would have saved time. On other other hand, it was only really clear that this was a better solution after we went down the first road and saw the effects. Hindsight is a wonderful thing

3. The original servers we bought proved to be less reliable than expected.

Because we knew we had replication, and because we knew we had a very specific setup we wanted (2U server, 12 drives, 8 x high capacity SATA, 4 x high speed SATA, RAID controller with battery backup, etc) that IBM couldn’t deliver, we went with a third party supplier. (http://blog.fastmail.fm/?p=524)

Suffice to say, this was a mistake. There is a big difference between hardware that runs stable for years vs hardware that runs stable for months. Replication should be a more a “disaster recovery” scenario, or a “controlled failover” scenario, it shouldn’t replace very reliable hardware.

We went back to equipment we trusted (IBM servers + external SATA-to-SCSI storage units). It’s a pity IBM are now 2.5 months late on delivering the servers they promised us. Trust me, we’ve already complained to them pretty severely about this. It’s lucky we were able to re-purpose some existing servers for new replicated roles.

So all up, how would I summarise.

Had we followed the “perfect” path straight up, things would have gotten to the fully replicated stage faster, though not enormously so, the debugging and software stage still took quite some time, it was more the hardware that slowed us down. On the other hand, the “perfect” path is often only visible with the benefit of hindsight. Additionally, by following some dud paths now, you learn not to take them again in the future.

Additional:

I’ve mentioned this in other posts now, but I should re-iterate that 85% of users were on replicated stores when this failure occurred. As Bron has mentioned, had it happened 1-2 weeks later, no-one would have noticed because that machine would have been out of service. This is actually part of the reason that soon as the restore was done, we could say “everyone was replicated”. So it’s not like 11 months had passed and nothing had happened.

  1. We’d chosen, tested and helped debugged a replication system
  2. We’d built 2 actual replication setups, scrapping the first after we realised a better arrangement
  3. We’d bought and organised 2 sets of extra hardware
  4. We’d already moved 85% of our user base to completely new servers
Posted in Technical. Comments Off

All users now on replicated servers

All users email is now on replicated servers. This means that every email delivered or deleted and every email action performed is replicated within a second to a completely separate server with a completely separate copy of all users emails.

We now have at least three levels of redundancy, three copies of every email, and all those copies are on RAID redundant storage themselves.

  1. All users now have their email stored on a system with RAID disks and all servers and RAID arrays have dual power supplies.

    This means a single drive or power supply failure should cause no interruption to service at all, we just replace the drive/power supply while the system is live and online. Hard drives and power supplies are the most common failing hardware components in computer systems.

  2. All users now have their email replicated to an identical replica system (RAID drives, dual power supplies, etc). Each system is completely separate; it’s own operating system, filesystem, drives, power, connections, etc. The replication is performed at the semantic email level, not at the filesystem level. So a filesystem corruption on the source server will not be replicated. This means if there is a disk or filesystem corruption on a single machine, we can just switch to the replica (failover) and it won’t cause a multi-day outage.

    The failover is not automatic, it is manual. Thus depending on the actual problem that occurs and our ability to analyse and respond, it should be on the order of minutes to an hour to failover to a replica if we decided it’s needed. In some cases, we may decide it’s easier and safer to reboot a frozen or crashed machine than failover to the replica, so it might be possible to still have outages up to an hour. If we believe the outage is going to go over that time, we will most likely failover to the replica.

    We can also use the failover ability to do maintenance on machines more easily. If we decide a machine needs servicing (kernel upgrade, hardware change, etc), we can just failover to a replica machine safely, do the work, start the machine up again and wait for replication to catch up, then failback to the machine. For users, the only visible downtime will be the controlled failover portion, which is usually on the order of 1 minute or so.

  3. All users have their email store backed up incrementally each night to a separate system and RAID array. The backups of email are kept for 1 week after the email is deleted to allow restoring in case of accident. In an emergency situation if both a master and replica server should fail catastrophically, we can still perform a restore from this backup

We believe that this will provide us the highest possible reliability while still allowing us to continue to grow our user base.

Posted in News. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 4,622 other followers