Here's an article with lots of links to classic networking techniques:
http://gafferongames.com/2009/01/25/game-networking-resources/
Here are some research papers leaning toward large scale issues:
http://maggotranch.com/biblio.html
Small and large scale online game techniques. Parallel and distributed systems. Technical philosophy. Some software engineering.
Tuesday, March 31, 2009
Friday, March 27, 2009
Entity Migration Mechanism
From earlier posts, we see that to achieve scalability, we need:
Setting aside the policy and impetus for initiating an Entity migration, lets talk about the mechanics. By separating policy and mechanism, we can experiment or customize the policy to use application-specific information, resulting in a closer to optimal solution. And we won't have to reimplement the mechanism each time.
We know we can run an Entity on any host by using interest management as discussed earlier to feed an Entity everything it needs to operate correctly. So all we really need to realize a migration is:
There is a handshake needed to get this state across:
So we need to add some steps:
One of the coolest features of interest management is that you can choose to not migrate and the game still runs the same (but may use more datacenter-only networking). So if you can't migrate an entity until it finishes a behavior (because you can't migrate your stack), no problem, just wait. Program that into your policy. If you find that hitches are visible but only when a player is in a heavy combat situation, your policy can delay initiating the migration until the participants have been quiescent for a while.
- large numbers of simulators
- an ability to load balance based on load (not just geography)
- an authoritative simulator to avoid DB bottlenecks
- a single-write paradigm to avoid overly complex synchronization
Setting aside the policy and impetus for initiating an Entity migration, lets talk about the mechanics. By separating policy and mechanism, we can experiment or customize the policy to use application-specific information, resulting in a closer to optimal solution. And we won't have to reimplement the mechanism each time.
We know we can run an Entity on any host by using interest management as discussed earlier to feed an Entity everything it needs to operate correctly. So all we really need to realize a migration is:
- getting the Entity state onto the new host
- getting the data flowing to that host that is needed by that Entity
- doing this quickly enough that there are no hiccups visible to the players
- avoid all ordering and race conditions so there is no game logic difference compared to not migrating (no side effects)
- survive crashes of any component at the worst possible moment (i.e. preserve important transactionality) without significant impact to the players
There is a handshake needed to get this state across:
- suspend further execution of the entity so things don't change during the migration
- transmit the state
- recreate the entity on the target host
- resume execution of the entity.
So we need to add some steps:
- Get the target subscriptions set up and acknowledged before the transfer so when the Entity arrives, all data is available there that it had in its original location
- Have the original simulator "flush" its DB queue so the DB never sees out of order persistence requests, and then stop persisting that entity until after the migration.
- increment an "epoch" counter to allow us to discard any replication messages or requests from the past
- Given the increase in time and complexity, it may be worth optimizing the process by pre-loading the target host without actually pausing the original entity. Then once everything is set up, resend states that may have changed during the preload. Of course, you might also make use of any state that was previously replicated to the target.
One of the coolest features of interest management is that you can choose to not migrate and the game still runs the same (but may use more datacenter-only networking). So if you can't migrate an entity until it finishes a behavior (because you can't migrate your stack), no problem, just wait. Program that into your policy. If you find that hitches are visible but only when a player is in a heavy combat situation, your policy can delay initiating the migration until the participants have been quiescent for a while.
Labels:
Interest Management,
Load balancing,
Scalability
Monday, March 16, 2009
Debugging leaked references
Everyone likes using smart pointers and reference counting to manage the lifetime of objects that are passed through modular interfaces. They can avoid a lot of copying and callers don't have to read a lot of documentation to figure out if they are responsible for deleting the object.
One big problem with using ref counted objects is when they "leak". It is extremely difficult to debug what code has taken an extra reference to an object that you think should have a single remaining reference. Why didn't this object destroy itself when I cleared this "last" smart pointer? When an object is passed through many layers and stored in various containers, there can be quite a few increment/decrement references occurring, so putting in breakpoints is not very useful. I have found this to be one of the most difficult and tedious problems to fix in a large application like a game.
Regular memory leak detectors are not too useful, since they just say "it leaked", not why. What you really want to know is the location of every outstanding reference to a block that you think should have already been deleted. You can ask this question at the end of a run on the unexpectedly outstanding blocks, or in the middle of a run when you have found something strange and want to backtrack.
It is "easy" but not much use getting the memory address of every smart pointer that points at the problem object. Make multi-map keyed on object addresses. Each time you assign or clear a smart pointer (increment or decrement a reference on an object), modify that map to include or exclude the address of that smart pointer. Done.
Making this "useful" is a matter of recording information about each smart pointer for later logging. Ideally, you would record a partial stack trace so you can see where the smart pointer was affected and what caused that. This requires another chunk of data beyond the smart pointer's memory location.
If you are only concerned about extra references, you could probably get away with recording the stack trace only on addRef. Note that for each smart pointer, there is only a single outstanding reference, so you only need the one stack trace per smart pointer.
If you have a situation where you have too many decRefs, you may need to record the stack trace of each addRef and decRef for the broken object. Garbage collecting that data is going to be interesting, because you can never be sure when an extra decRef might happen.
Putting this together, you can use regular memory leak detection to identying unexpectedly outstanding ref counted objects. Then ask this ref count debugging system for the stack trace(s) of the where the smart pointers where initialized that are still holding a reference to that object.
One big problem with using ref counted objects is when they "leak". It is extremely difficult to debug what code has taken an extra reference to an object that you think should have a single remaining reference. Why didn't this object destroy itself when I cleared this "last" smart pointer? When an object is passed through many layers and stored in various containers, there can be quite a few increment/decrement references occurring, so putting in breakpoints is not very useful. I have found this to be one of the most difficult and tedious problems to fix in a large application like a game.
Regular memory leak detectors are not too useful, since they just say "it leaked", not why. What you really want to know is the location of every outstanding reference to a block that you think should have already been deleted. You can ask this question at the end of a run on the unexpectedly outstanding blocks, or in the middle of a run when you have found something strange and want to backtrack.
It is "easy" but not much use getting the memory address of every smart pointer that points at the problem object. Make multi-map keyed on object addresses. Each time you assign or clear a smart pointer (increment or decrement a reference on an object), modify that map to include or exclude the address of that smart pointer. Done.
Making this "useful" is a matter of recording information about each smart pointer for later logging. Ideally, you would record a partial stack trace so you can see where the smart pointer was affected and what caused that. This requires another chunk of data beyond the smart pointer's memory location.
If you are only concerned about extra references, you could probably get away with recording the stack trace only on addRef. Note that for each smart pointer, there is only a single outstanding reference, so you only need the one stack trace per smart pointer.
If you have a situation where you have too many decRefs, you may need to record the stack trace of each addRef and decRef for the broken object. Garbage collecting that data is going to be interesting, because you can never be sure when an extra decRef might happen.
Putting this together, you can use regular memory leak detection to identying unexpectedly outstanding ref counted objects. Then ask this ref count debugging system for the stack trace(s) of the where the smart pointers where initialized that are still holding a reference to that object.
Friday, March 13, 2009
What is a memory leak and why do we care?
I am a big fan of Purify. It is a tool that helps programmers deal with the challenging memory systems of C and C++ . In those languages, we can "leak" memory.
We try to get rid of all the leaks (usually a few days before we ship). I've seen this take quite a lot of effort. Superficially, it seems like wasted effort since modern operating systems keep track of your memory and recover it when your processes exits. So, really, we could just exit and kill the process. The app would shut down a whole lot faster.
Why does that sound like such a bad idea? Why do we even care about leaks?
The tools used are simple. They keep track of outstanding allocations, but don't and can't subtract precise leaks. They have neat ways of showing where the memory was allocated, and so on. The app writer finds all those outstanding allocations and carefully clears them as the application shuts down.
But you don't super-need-to clear the in-use memory. It is not hurting anything. You could delete it any time, if you really wanted to (sounds like an addict). OK. There is one way it can hurt something. You could accidentally have lists or free-lists that bloat up and it will look like you really are using that memory. However, since it is in-use, you should be able to instrument your lists as watch them bloat up.
You do super-need-to clear precise leaks. They are caused by overwriting or clearing your pointers without doing the deallocate first. If you don't plug those leaks, you are going to sink, or crash, or lose oil pressure or air pressure, or some other analogy.
How could Purify possibly differentiate between precise leaks and in-use? Believe it or not, it watches every pointer assignment. And sees when the last reference to a block is lost. It does this with something called Object Code Instrumentation, so it doesn't care what compiler you are using. It inserts assembly code into your executable (e.g. around assignments), fixes up jumps and symbol addresses it changes when making room for the inserted code. It consequently knows what you have done or accidentally done to every pointer (including nasty things with pointer math).
As a result it can focus the coders attention on blocks that are unreferencable. It can even (theoretically) throw a breakpoint at the instruction that overwrites the pointer. I know it can break at the instruction where you read an uninitialized variable. At any point during debugging, you can make a call and have it dump unreferencable blocks of memory and where they were allocated. Of course you can also dump all in-use blocks by calling a function when paused in the debugger. I believe you can make a checkpoint and then later dump all blocks that are in-use that were allocated since the checkpoint.
If you insert some code in your custom allocator, the Purify SDK can even use your allocator to do its magic. You could make calls to it at run time to dump metrics or react to issues.
As you can see, unreferencable blocks are the real leaks. We only have to clear out all in-use blocks because we don't use tools like Purify, and have overreact and compromise. I don't like the busy work of clearing every last legitimate in-use block. I don't think I should have-to-have-to. (As long as I'm careful about my "other" resources like I/O buffers.) It makes for better code if I do, but if I want to trade time for quality or one feature for another, I still have to clear the real leaks, and like the option of ignoring the other blocks swept up by the tradition definition.
It has a bunch of other cool stuff too. Like knowing when you access memory that is uninitialized, or already deallocated. It knows when you run off the end of an array. It can instrument third party object code even if you don't have source, and it doesn't use your allocator.
Of course there is a runtime performance hit, but it isn't super bad. And you can control which modules are instrumented, so you might exclude your rendering for example.
It has a UI that integrates to Visual Studio and gives you a nice report of these various kinds of memory errors at the end of your run, or whenever you ask. It will even take you to the offending line of code.
Don't balk at the price either. Just remember the amount of time the poor guy spent that was stuck with clear every last in-use block. It more than pays for itself very quickly.
We try to get rid of all the leaks (usually a few days before we ship). I've seen this take quite a lot of effort. Superficially, it seems like wasted effort since modern operating systems keep track of your memory and recover it when your processes exits. So, really, we could just exit and kill the process. The app would shut down a whole lot faster.
Why does that sound like such a bad idea? Why do we even care about leaks?
- We don't want to have our process use more resources than it needs. It could crash or make other apps unhappy.
- It makes for higher quality software. We are strictly disciplined about who owns what.
- Managing the scope/lifetime of an object is occasionally very important. For example, we may be holding some I/O that needs to be flushed.
- Precise: a leak is an allocated block of memory that you don't have a pointer to anymore. So you can't ever delete it. A potential leak is one where you only have a pointer to an address somewhere in the middle of the block. This might happen if you do some funky pointer arithmetic, and intend to undo it to later delete the block. Anything else is referred to as an in-use allocation.
- Traditional: a leak is a block that is still allocated (outstanding) when the application shuts down.
The tools used are simple. They keep track of outstanding allocations, but don't and can't subtract precise leaks. They have neat ways of showing where the memory was allocated, and so on. The app writer finds all those outstanding allocations and carefully clears them as the application shuts down.
But you don't super-need-to clear the in-use memory. It is not hurting anything. You could delete it any time, if you really wanted to (sounds like an addict). OK. There is one way it can hurt something. You could accidentally have lists or free-lists that bloat up and it will look like you really are using that memory. However, since it is in-use, you should be able to instrument your lists as watch them bloat up.
You do super-need-to clear precise leaks. They are caused by overwriting or clearing your pointers without doing the deallocate first. If you don't plug those leaks, you are going to sink, or crash, or lose oil pressure or air pressure, or some other analogy.
How could Purify possibly differentiate between precise leaks and in-use? Believe it or not, it watches every pointer assignment. And sees when the last reference to a block is lost. It does this with something called Object Code Instrumentation, so it doesn't care what compiler you are using. It inserts assembly code into your executable (e.g. around assignments), fixes up jumps and symbol addresses it changes when making room for the inserted code. It consequently knows what you have done or accidentally done to every pointer (including nasty things with pointer math).
As a result it can focus the coders attention on blocks that are unreferencable. It can even (theoretically) throw a breakpoint at the instruction that overwrites the pointer. I know it can break at the instruction where you read an uninitialized variable. At any point during debugging, you can make a call and have it dump unreferencable blocks of memory and where they were allocated. Of course you can also dump all in-use blocks by calling a function when paused in the debugger. I believe you can make a checkpoint and then later dump all blocks that are in-use that were allocated since the checkpoint.
If you insert some code in your custom allocator, the Purify SDK can even use your allocator to do its magic. You could make calls to it at run time to dump metrics or react to issues.
As you can see, unreferencable blocks are the real leaks. We only have to clear out all in-use blocks because we don't use tools like Purify, and have overreact and compromise. I don't like the busy work of clearing every last legitimate in-use block. I don't think I should have-to-have-to. (As long as I'm careful about my "other" resources like I/O buffers.) It makes for better code if I do, but if I want to trade time for quality or one feature for another, I still have to clear the real leaks, and like the option of ignoring the other blocks swept up by the tradition definition.
It has a bunch of other cool stuff too. Like knowing when you access memory that is uninitialized, or already deallocated. It knows when you run off the end of an array. It can instrument third party object code even if you don't have source, and it doesn't use your allocator.
Of course there is a runtime performance hit, but it isn't super bad. And you can control which modules are instrumented, so you might exclude your rendering for example.
It has a UI that integrates to Visual Studio and gives you a nice report of these various kinds of memory errors at the end of your run, or whenever you ask. It will even take you to the offending line of code.
Don't balk at the price either. Just remember the amount of time the poor guy spent that was stuck with clear every last in-use block. It more than pays for itself very quickly.
Subscribe to:
Posts (Atom)