Push events, NAT TCP connection timeouts, and device sleep

This is a technical post. Regular FastMail users subscribed to receive email updates from the FastMail blog can just ignore this post.

When we released the new user interface last year, one of the improvements included was push updates when new emails arrived.

In theory, push events are conceptually quite easy to do. We open a connection from your web browser to the server (see this blog post for details), then when a new email arrives, we send a message down the open connection to let the browser know. It then fetches the details about the new email(s) and refreshes the display.

Unfortunately, in the real world, it’s not quite that easy. The biggest problem is that when a mailbox is mostly idle (no new mail arriving), the connection from the browser to the server will be idle. While this shouldn’t be a problem, it turns out it often is.

As we have noted before, some of our users are behind NAT gateways/stateful firewalls that have short state timeouts. If you leave a TCP connection idle for too long (variable from 2 to 30 minutes depending on the device), these start dropping any new packets on the connection.

In the case of a push connection from the server to the client, this is particularly bad. When a new email arrives, the server will try and send data to the client, and then be told the connection is dead at that point. That’s fine for the server, it can then clean up the connection. However, the client will never see any data from the server, and neither will the client ever know that the gateway/firewall has broken the connection. The client will think it is still connected to the server and has no way of knowing that the connection has actually been broken. This is purely a consequence of the way the TCP protocol works. The only way for the client to be able to tell the connection is broken is to send some data down the connection, and there are only 2 ways that can happen.

  1. If the client has enabled TCP keepalive on the socket. Currently only Chrome on Windows does this.
  2. If the client sent some data down the connection to the server. Unfortunately the eventsource specification doesn’t provide any way to do this; it basically assumes the underlying TCP connection is always reliable and only the server can send to the client.

One way to try and work around this issue is for the server to send regular “ping” events to the client, sufficiently often that the gateway/firewall knows the connection is still alive. This is relatively straightforward to do, but causes other problems.

If the ping events come too fast, it can cause some clients to never go into sleep mode. For instance, we used to send ping events every 60 seconds. It was noted that on an iPad if you left the FastMail webpage open in Safari and put the iPad down, the iPad itself would never actually go to sleep. The screen would stay on, draining the battery very quickly.

Because of that, we decided to go the other direction and disabled the ping events, but that ends up back at the other end of the scale where sometimes push just seems to randomly stop working.

As there’s no perfect solution to this problem, we’re now changing again to a new trade off.

  1. The server will send regular “ping” events to the client at 5 minute intervals. This should be enough for most gateways/firewalls to keep the connection open, but long enough apart to allow devices to go to sleep.
  2. If the client doesn’t see a ping event after 6 minutes, it assumes the connection has died, drops the existing connection and creates a new one. This should at least allow push events to work to some extent on connections with gateways/firewalls with low timeouts.

This change has now been rolled out everywhere. Based on initial testing, we think that this time we’ve got the balance between theory and reality right.

Posted in Technical. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 5,291 other followers

%d bloggers like this: