Monday, January 18, 2010

Dynamic Load Balancing a Large Scale Online Game

Why bother designing your server to support dynamic load balancing? You can load test, measure and come up with a static load balance at some point before going live, or periodically when live. But...
  • Your measurements and estimates will be wrong. Be honest, the load you simulated was at best an educated guess. Even a live Beta is not going to fully represent a real live situation.
  • Hardware specs change. It takes time to finish development, and who knows what hardware is going to be the most cost effective by the time you are done. You definitely don't want to have to change code to decompose your system a different way just because of that.
  • Your operations or data center may impose something unexpected, or may not have everything available that you asked for. You might think "throw more hardware at the problem". But if they are doing their jobs, they won't let you. And if you are being honest with yourself, you know that probably wouldn't have worked anyway.
  • Hardware fails. You may lose a couple of machines and not be able to replace them immediately. Even if you shut down, reconfigure, and restart a shard, the change to the load balance must be trivial and quick. The easiest way is to have the system itself adjust.
  • Your players are going to do many unexpected things. Like all rush toward one interesting location in the game. Maybe the designers choose to do this on purpose using a holiday event. Maybe they would really appreciate if your system could stand up to such a thing so they *could* please the players with such a thing.
  • The load changes in interesting sine waves. Late at night and during weekdays, the load will be substantially less than at peak times. That is a lot of hardware just idling. If your system can automatically migrate load to fewer machines, and give back leased machines (e.g. emergency overload hardware) you might be able to cut a deal with your hosting service to save some money. Anybody know whether the "cloud" services support this? What if you are supporting multiple titles whose load profiles are offset. You could reallocate machines from one to another dynamically.
  • Early shards tend to have quite a lot higher populations than newly opened ones. As incentives to transfer to new shards start having effect, hardware could be transferred so that responsiveness can remain constant while population and density changes.
  • The design is going to change. Both during development and as a result of tuning, patches and expansions. If you want to let the designers make the game as fun (and successful) as possible, you don't want to give them too many restrictions.
  • It may be painful to think about but your game is going to eventually wind down. There is a long tail of committed players that will stay, but the population will drop. If you can jettison unneeded hardware, you can save money and make that time more profitable. (And you should encourage your designers to support merging of shards.) I am convinced that there is a lot of money left on the table by games that prematurely close their doors.
So what can you you actually "balance"? You can't decompose a process, reallocate objects, computation and data structures between processes. Not without reprogramming. Or programming it in from the beginning. So load balancing entire processes is not likely to cut it.

The best way to do that is design for parallelism. Functional parallelism only goes so far. E.g. if you have only one process that deals with all banking, you can't split it when there is a run on the bank.

So what kinds of things are heavy users of resources? What things are relatively easy to decompose into lots of small bits (then recombine into sensibly sized chunks using load balancing)?

Here are some ideas:
  • Entities. Using interest management (discussed in other blog entries), an Entity can be located in any simulator process on any host. There are communication overheads to consider, but those are within the data center. If you are creative, many of the features that compose a server can be represented as an Entity, even though we often limit our thinking to them as game Entities. E.g. a quest, a quest party/group, a guild, an email, a zone, the weather, game and system metrics, ... And of course, characters, monsters, and loot. The benefit of making more things an Entity is that you can use the same development tools, DB representation, execution environment, scripting/behavior system and SDK, ... And of course, load balancing. There are usually a very large number of Entities in a game, making it pretty easy to find a good balance (e.g. bin-packing). Picking up an Entity is often a simple matter of grabbing a copy of its Properties (assuming you've designed your Entities like this to begin with; with load balancing in mind). This can be fast because Entities tend to be small. Another thing I like about Entity migration is that there are lots of times when an Entity goes idle, making it easy to migrate without a player being affected at all. Larger "units" of decomposition are likely to never be dormant, so when a migration occurs, players feel a lag.
  • Zones. This is a pretty common approach, often with a number of zones allocated to a single process on a host. As load passes a threshold, the zone is restarted on another simulator on another machine. This is a bigger chunk of migration than an Entity, and doesn't allow for an overload within one zone. The designers have to add game play mechanisms to discourage too much crowding together. The zone size has to be chosen appropriately ahead of time. Hopefully load-balancing-zone is not the same as game-play-zone, or the content team will really hate you. Can you imagine asking them to redesign and lay out a zone because there was a server overload?
  • Modules. You will decompose your system design into modules, systems, or functional units. Making the computation of each of these be able to be mapped requires little extra work. Although there are usually a limited number of systems (functional parallelism), and there is almost always a "hog" (See Amdahl's law). Extracting a Module and moving it requires quite a bit more unwiring than an Entity. Not my first choice. But you might rely on your fault tolerance system and just shut something down in one place, and have it restart elsewhere.
  • Processes. You may be in a position where your system cannot easily have chunks broken off and compiled into another process. In this case, only whole processes can be migrated (assuming they do not share memory or files). Process migration is pretty complicated and slow, given how much memory is involved. Again, your fault tolerance mechanism might help you. If you have enough processes that you can load balance by moving them around, you may also have a lot of overhead from things like messages crossing process boundaries (usually via system calls).
  • Virtual Machines. Modern data centers provide (for a price) the ability to re-host a virtual machine, even on the fly. Has anyone tested what the latency of this is? Seems like a lot of data to transmit. The benefit of this kind of thinking is that you can configure your shard in the lab without knowing how many machines you are going to have, and run multiple VM on a single box. But you can't run a single VM on multiple boxes. So you have that tradeoff of too many giving high overhead, and too few giving poor balancing options.
Remember, these things are different from one another:
  • Decomposition for good software engineering.
  • Decomposition for good parallel performance.
  • Initial static load balancing.
  • Creating new work on the currently least loaded machine.
  • Dynamically migrating work.
  • Dynamically migrating work from busy to less loaded machines.
  • Doing it efficiently, and quickly (without lags)
  • And having it be effective.
I think balancing load is a hard enough problem that it can't really be predicted and "solved" ahead of time. So I like to give myself as much flexibility ahead of time, and good tools. Even if you don't realize full dynamic migration at first, at least don't box yourself into a corner that requires rearchitecting.