Friday, March 27, 2009

Entity Migration Mechanism

From earlier posts, we see that to achieve scalability, we need:
  • large numbers of simulators
  • an ability to load balance based on load (not just geography)
  • an authoritative simulator to avoid DB bottlenecks
  • a single-write paradigm to avoid overly complex synchronization
And since our world scales with the number of Entities, not the number of functions in the game, then load balancing, and thus, scalability is realized using Entity migration.

Setting aside the policy and impetus for initiating an Entity migration, lets talk about the mechanics. By separating policy and mechanism, we can experiment or customize the policy to use application-specific information, resulting in a closer to optimal solution. And we won't have to reimplement the mechanism each time.

We know we can run an Entity on any host by using interest management as discussed earlier to feed an Entity everything it needs to operate correctly. So all we really need to realize a migration is:
  • getting the Entity state onto the new host
  • getting the data flowing to that host that is needed by that Entity
  • doing this quickly enough that there are no hiccups visible to the players
  • avoid all ordering and race conditions so there is no game logic difference compared to not migrating (no side effects)
  • survive crashes of any component at the worst possible moment (i.e. preserve important transactionality) without significant impact to the players
First, we use a data-driven means to identify which state variables need to be transferred to the new host. There is no reason to transfer truly temporary variables, but there are reasons to transmit variables that are not needed in the persistent database. E.g. current target. There are many mechanisms to serialize an entity and reconstitute it. One challenging aspect is whether to transfer the execution context (e.g. the stack and program counter) if your simulator uses coroutines to support blocking and waiting in a Behavior script. For example, Stackless Python is famous for pickling coroutines and reconstituting them.

There is a handshake needed to get this state across:
  • suspend further execution of the entity so things don't change during the migration
  • transmit the state
  • recreate the entity on the target host
  • resume execution of the entity.
Seems simple, but consider that update messages to players from the original and destination host might get out of order, the DB request queue might get backed up on the source (after all you are migrating away from a busy simulator) and save requests to the DB might get out of order, replicated state on the source and target might be at different versions (the entity may see a neighbor jump backward or forward in time).

So we need to add some steps:
  • Get the target subscriptions set up and acknowledged before the transfer so when the Entity arrives, all data is available there that it had in its original location
  • Have the original simulator "flush" its DB queue so the DB never sees out of order persistence requests, and then stop persisting that entity until after the migration.
  • increment an "epoch" counter to allow us to discard any replication messages or requests from the past
  • Given the increase in time and complexity, it may be worth optimizing the process by pre-loading the target host without actually pausing the original entity. Then once everything is set up, resend states that may have changed during the preload. Of course, you might also make use of any state that was previously replicated to the target.
There a quite a few trick available and needed to get this to come out right and be efficient. But distributed transactions like this happen reliably in a lot of "critical" systems in other industries so it is quite solvable. You can see how the need for migration and the requirement to do it quickly without player-visible hitches requires us to adopt many of the design principles already accepted: authoritative simulator, interest management, data-driven persistence and replication traits describing entity state variables, ... All of these key features are intertwined, so if your systems goes off track somewhere there, you may be buying a lot trouble elsewhere.

One of the coolest features of interest management is that you can choose to not migrate and the game still runs the same (but may use more datacenter-only networking). So if you can't migrate an entity until it finishes a behavior (because you can't migrate your stack), no problem, just wait. Program that into your policy. If you find that hitches are visible but only when a player is in a heavy combat situation, your policy can delay initiating the migration until the participants have been quiescent for a while.

No comments:

Post a Comment