Tuesday, March 31, 2009

Networking Resource

Here's an article with lots of links to classic networking techniques:

Here are some research papers leaning toward large scale issues:

Friday, March 27, 2009

Entity Migration Mechanism

From earlier posts, we see that to achieve scalability, we need:
  • large numbers of simulators
  • an ability to load balance based on load (not just geography)
  • an authoritative simulator to avoid DB bottlenecks
  • a single-write paradigm to avoid overly complex synchronization
And since our world scales with the number of Entities, not the number of functions in the game, then load balancing, and thus, scalability is realized using Entity migration.

Setting aside the policy and impetus for initiating an Entity migration, lets talk about the mechanics. By separating policy and mechanism, we can experiment or customize the policy to use application-specific information, resulting in a closer to optimal solution. And we won't have to reimplement the mechanism each time.

We know we can run an Entity on any host by using interest management as discussed earlier to feed an Entity everything it needs to operate correctly. So all we really need to realize a migration is:
  • getting the Entity state onto the new host
  • getting the data flowing to that host that is needed by that Entity
  • doing this quickly enough that there are no hiccups visible to the players
  • avoid all ordering and race conditions so there is no game logic difference compared to not migrating (no side effects)
  • survive crashes of any component at the worst possible moment (i.e. preserve important transactionality) without significant impact to the players
First, we use a data-driven means to identify which state variables need to be transferred to the new host. There is no reason to transfer truly temporary variables, but there are reasons to transmit variables that are not needed in the persistent database. E.g. current target. There are many mechanisms to serialize an entity and reconstitute it. One challenging aspect is whether to transfer the execution context (e.g. the stack and program counter) if your simulator uses coroutines to support blocking and waiting in a Behavior script. For example, Stackless Python is famous for pickling coroutines and reconstituting them.

There is a handshake needed to get this state across:
  • suspend further execution of the entity so things don't change during the migration
  • transmit the state
  • recreate the entity on the target host
  • resume execution of the entity.
Seems simple, but consider that update messages to players from the original and destination host might get out of order, the DB request queue might get backed up on the source (after all you are migrating away from a busy simulator) and save requests to the DB might get out of order, replicated state on the source and target might be at different versions (the entity may see a neighbor jump backward or forward in time).

So we need to add some steps:
  • Get the target subscriptions set up and acknowledged before the transfer so when the Entity arrives, all data is available there that it had in its original location
  • Have the original simulator "flush" its DB queue so the DB never sees out of order persistence requests, and then stop persisting that entity until after the migration.
  • increment an "epoch" counter to allow us to discard any replication messages or requests from the past
  • Given the increase in time and complexity, it may be worth optimizing the process by pre-loading the target host without actually pausing the original entity. Then once everything is set up, resend states that may have changed during the preload. Of course, you might also make use of any state that was previously replicated to the target.
There a quite a few trick available and needed to get this to come out right and be efficient. But distributed transactions like this happen reliably in a lot of "critical" systems in other industries so it is quite solvable. You can see how the need for migration and the requirement to do it quickly without player-visible hitches requires us to adopt many of the design principles already accepted: authoritative simulator, interest management, data-driven persistence and replication traits describing entity state variables, ... All of these key features are intertwined, so if your systems goes off track somewhere there, you may be buying a lot trouble elsewhere.

One of the coolest features of interest management is that you can choose to not migrate and the game still runs the same (but may use more datacenter-only networking). So if you can't migrate an entity until it finishes a behavior (because you can't migrate your stack), no problem, just wait. Program that into your policy. If you find that hitches are visible but only when a player is in a heavy combat situation, your policy can delay initiating the migration until the participants have been quiescent for a while.

Monday, March 16, 2009

Debugging leaked references

Everyone likes using smart pointers and reference counting to manage the lifetime of objects that are passed through modular interfaces. They can avoid a lot of copying and callers don't have to read a lot of documentation to figure out if they are responsible for deleting the object.

One big problem with using ref counted objects is when they "leak". It is extremely difficult to debug what code has taken an extra reference to an object that you think should have a single remaining reference. Why didn't this object destroy itself when I cleared this "last" smart pointer? When an object is passed through many layers and stored in various containers, there can be quite a few increment/decrement references occurring, so putting in breakpoints is not very useful. I have found this to be one of the most difficult and tedious problems to fix in a large application like a game.

Regular memory leak detectors are not too useful, since they just say "it leaked", not why. What you really want to know is the location of every outstanding reference to a block that you think should have already been deleted. You can ask this question at the end of a run on the unexpectedly outstanding blocks, or in the middle of a run when you have found something strange and want to backtrack.

It is "easy" but not much use getting the memory address of every smart pointer that points at the problem object. Make multi-map keyed on object addresses. Each time you assign or clear a smart pointer (increment or decrement a reference on an object), modify that map to include or exclude the address of that smart pointer. Done.

Making this "useful" is a matter of recording information about each smart pointer for later logging. Ideally, you would record a partial stack trace so you can see where the smart pointer was affected and what caused that. This requires another chunk of data beyond the smart pointer's memory location.

If you are only concerned about extra references, you could probably get away with recording the stack trace only on addRef. Note that for each smart pointer, there is only a single outstanding reference, so you only need the one stack trace per smart pointer.

If you have a situation where you have too many decRefs, you may need to record the stack trace of each addRef and decRef for the broken object. Garbage collecting that data is going to be interesting, because you can never be sure when an extra decRef might happen.

Putting this together, you can use regular memory leak detection to identying unexpectedly outstanding ref counted objects. Then ask this ref count debugging system for the stack trace(s) of the where the smart pointers where initialized that are still holding a reference to that object.

Friday, March 13, 2009

What is a memory leak and why do we care?

I am a big fan of Purify. It is a tool that helps programmers deal with the challenging memory systems of C and C++ . In those languages, we can "leak" memory.

We try to get rid of all the leaks (usually a few days before we ship). I've seen this take quite a lot of effort. Superficially, it seems like wasted effort since modern operating systems keep track of your memory and recover it when your processes exits. So, really, we could just exit and kill the process. The app would shut down a whole lot faster.

Why does that sound like such a bad idea? Why do we even care about leaks?
  • We don't want to have our process use more resources than it needs. It could crash or make other apps unhappy.
  • It makes for higher quality software. We are strictly disciplined about who owns what.
  • Managing the scope/lifetime of an object is occasionally very important. For example, we may be holding some I/O that needs to be flushed.
There are two competing definitions of leak. Obviously I like the Purify definition better. I'm going to argue that the other (more common) definition is an oversimplification and a compromise:
  • Precise: a leak is an allocated block of memory that you don't have a pointer to anymore. So you can't ever delete it. A potential leak is one where you only have a pointer to an address somewhere in the middle of the block. This might happen if you do some funky pointer arithmetic, and intend to undo it to later delete the block. Anything else is referred to as an in-use allocation.
  • Traditional: a leak is a block that is still allocated (outstanding) when the application shuts down.
Most custom memory allocators use the traditional definition, because developers have no good way of providing the more precise definition. They require the application writer to build good quality code that carefully deletes all the memory it allocates such that none is left outstanding at the end of execution. Certainly that definition subsumes the Precise definition. The traditional definition is saying there are no leaks, potential leaks, nor outstanding allocations. Nothing. Period. How can you argue with that?

The tools used are simple. They keep track of outstanding allocations, but don't and can't subtract precise leaks. They have neat ways of showing where the memory was allocated, and so on. The app writer finds all those outstanding allocations and carefully clears them as the application shuts down.

But you don't super-need-to clear the in-use memory. It is not hurting anything. You could delete it any time, if you really wanted to (sounds like an addict). OK. There is one way it can hurt something. You could accidentally have lists or free-lists that bloat up and it will look like you really are using that memory. However, since it is in-use, you should be able to instrument your lists as watch them bloat up.

You do super-need-to clear precise leaks. They are caused by overwriting or clearing your pointers without doing the deallocate first. If you don't plug those leaks, you are going to sink, or crash, or lose oil pressure or air pressure, or some other analogy.

How could Purify possibly differentiate between precise leaks and in-use? Believe it or not, it watches every pointer assignment. And sees when the last reference to a block is lost. It does this with something called Object Code Instrumentation, so it doesn't care what compiler you are using. It inserts assembly code into your executable (e.g. around assignments), fixes up jumps and symbol addresses it changes when making room for the inserted code. It consequently knows what you have done or accidentally done to every pointer (including nasty things with pointer math).

As a result it can focus the coders attention on blocks that are unreferencable. It can even (theoretically) throw a breakpoint at the instruction that overwrites the pointer. I know it can break at the instruction where you read an uninitialized variable. At any point during debugging, you can make a call and have it dump unreferencable blocks of memory and where they were allocated. Of course you can also dump all in-use blocks by calling a function when paused in the debugger. I believe you can make a checkpoint and then later dump all blocks that are in-use that were allocated since the checkpoint.

If you insert some code in your custom allocator, the Purify SDK can even use your allocator to do its magic. You could make calls to it at run time to dump metrics or react to issues.

As you can see, unreferencable blocks are the real leaks. We only have to clear out all in-use blocks because we don't use tools like Purify, and have overreact and compromise. I don't like the busy work of clearing every last legitimate in-use block. I don't think I should have-to-have-to. (As long as I'm careful about my "other" resources like I/O buffers.) It makes for better code if I do, but if I want to trade time for quality or one feature for another, I still have to clear the real leaks, and like the option of ignoring the other blocks swept up by the tradition definition.

It has a bunch of other cool stuff too. Like knowing when you access memory that is uninitialized, or already deallocated. It knows when you run off the end of an array. It can instrument third party object code even if you don't have source, and it doesn't use your allocator.

Of course there is a runtime performance hit, but it isn't super bad. And you can control which modules are instrumented, so you might exclude your rendering for example.

It has a UI that integrates to Visual Studio and gives you a nice report of these various kinds of memory errors at the end of your run, or whenever you ask. It will even take you to the offending line of code.

Don't balk at the price either. Just remember the amount of time the poor guy spent that was stuck with clear every last in-use block. It more than pays for itself very quickly.