Thursday, April 30, 2009

Why would you want gateway boxes?

An important aspect of an MMO shard is where and how clients connect. My preference is to have a set of Gateway boxes that responsible for managing those connections. Their responsibilities include:
  • Authentication handshakes with the client and account system
  • Rapid filtering of malicious traffic (IP filtering or more), denial of service attacks, smurfing (billions of identical, apparently innocuous/legal requests), and so on
  • Separation of responsibility. This is valuable in a memory and cache efficiency sense, in that a single machine is focusing on just one thing. Simulators don't have to keep track of clients. They only connect to GW boxes.
  • Instead of N clients each connecting to up to K simulators (N*K connections), you have only N client connections, each to only one of J gateways and the K simulators only connect to the J gateways (N+J*K). Since N is much bigger than J, this is a huge advantage both in connection count, connection management processing overhead, and memory on both client and server hosts. Normally N is quite a bit bigger than J (say 4 times).
  • Message exploding and the majority of connections are happening in the data center over high speed switches, and in a secure environment on the backend switch.
  • By focusing only on GW functions, and not running game logic, they can be made much more reliable than simulator boxes/processes. This can help a lot with player experience during fault recovery
  • The Gateway boxes are the only ones with public IP addresses, so it allows a large fraction of your shard to be secure by having no direct route from the WAN. The idea here is GW boxes have (at least) two NICs, one on the switch with the main firewall, the other on the backend network.
  • This also has physical network topology benefits, since the back end hosts can be on their own switch.
  • Message header overhead is reduced when sending to a client, since all data to one client is from a single shard host, and it can do bundling for over the WAN messages (most important).
  • Gateways can also "be" the lobby or character selector prior to entering the game world.
  • Non simulation messages like chat or game service stuff (auctions, guild management, email, patching/streaming of content, ...) don't bother the simulators.
  • The sizing and configuration to optimize for the size of your peak player connections are now independent of that for the simulators and load balancing.
I also subscribe to the philosophy of persistent client connections (or with connectionless protocols, staying with one assigned gateway). The major benefit of this is that a client does not have to reauthenticate and renegotiate their connection with another host in the shard when their character moves around in the world, or some other load balancing activity changes the simulators they need to interact with.

To do this, the GW is also responsible for routing messages between the client and the simulators that are "of interest". This gets back to category based routing and channel managers discussed earlier. Data from multiple simulators is sent to the GW box and forwarded to each interested client.

Data from the client tends to be routed to the one Simulator that client is currently using. I.e. the one that owns its "controller" entity where client requests are validated, and (normally) their player character is owned/simulated.

You want multiple gateway processes (not multi threaded) per gateway box to avoid losing as many player connections if something should crash (and then reauthenticating, etc). This also helps deal with file descriptor limitations per process for the connections if your OS configuration limits you.

There are downsides, but not overwhelming:
  • An extra hop for most messages. This hop is on a datacenter switch, and will be very fast.
  • There are extra machines to buy. Well, not really, the same message handling work is being done but not directly by the simulators, so they can get more done each (and that has other more subtle communcation benefits). We just partition it, and use the same number of machines.
  • An extra switch and extra NICs. You can use two VLANs on any decent switch, if you have to.
In summary, you are just moving some work from one place to another in the same sized shard, but getting a lot of system simplicity, security, and communication benefits.

1 comment:

  1. Very interesting...things to file away for future endeavors ;). Thank you for sharing!