Friday, March 13, 2009

What is a memory leak and why do we care?

I am a big fan of Purify. It is a tool that helps programmers deal with the challenging memory systems of C and C++ . In those languages, we can "leak" memory.

We try to get rid of all the leaks (usually a few days before we ship). I've seen this take quite a lot of effort. Superficially, it seems like wasted effort since modern operating systems keep track of your memory and recover it when your processes exits. So, really, we could just exit and kill the process. The app would shut down a whole lot faster.

Why does that sound like such a bad idea? Why do we even care about leaks?
  • We don't want to have our process use more resources than it needs. It could crash or make other apps unhappy.
  • It makes for higher quality software. We are strictly disciplined about who owns what.
  • Managing the scope/lifetime of an object is occasionally very important. For example, we may be holding some I/O that needs to be flushed.
There are two competing definitions of leak. Obviously I like the Purify definition better. I'm going to argue that the other (more common) definition is an oversimplification and a compromise:
  • Precise: a leak is an allocated block of memory that you don't have a pointer to anymore. So you can't ever delete it. A potential leak is one where you only have a pointer to an address somewhere in the middle of the block. This might happen if you do some funky pointer arithmetic, and intend to undo it to later delete the block. Anything else is referred to as an in-use allocation.
  • Traditional: a leak is a block that is still allocated (outstanding) when the application shuts down.
Most custom memory allocators use the traditional definition, because developers have no good way of providing the more precise definition. They require the application writer to build good quality code that carefully deletes all the memory it allocates such that none is left outstanding at the end of execution. Certainly that definition subsumes the Precise definition. The traditional definition is saying there are no leaks, potential leaks, nor outstanding allocations. Nothing. Period. How can you argue with that?

The tools used are simple. They keep track of outstanding allocations, but don't and can't subtract precise leaks. They have neat ways of showing where the memory was allocated, and so on. The app writer finds all those outstanding allocations and carefully clears them as the application shuts down.

But you don't super-need-to clear the in-use memory. It is not hurting anything. You could delete it any time, if you really wanted to (sounds like an addict). OK. There is one way it can hurt something. You could accidentally have lists or free-lists that bloat up and it will look like you really are using that memory. However, since it is in-use, you should be able to instrument your lists as watch them bloat up.

You do super-need-to clear precise leaks. They are caused by overwriting or clearing your pointers without doing the deallocate first. If you don't plug those leaks, you are going to sink, or crash, or lose oil pressure or air pressure, or some other analogy.

How could Purify possibly differentiate between precise leaks and in-use? Believe it or not, it watches every pointer assignment. And sees when the last reference to a block is lost. It does this with something called Object Code Instrumentation, so it doesn't care what compiler you are using. It inserts assembly code into your executable (e.g. around assignments), fixes up jumps and symbol addresses it changes when making room for the inserted code. It consequently knows what you have done or accidentally done to every pointer (including nasty things with pointer math).

As a result it can focus the coders attention on blocks that are unreferencable. It can even (theoretically) throw a breakpoint at the instruction that overwrites the pointer. I know it can break at the instruction where you read an uninitialized variable. At any point during debugging, you can make a call and have it dump unreferencable blocks of memory and where they were allocated. Of course you can also dump all in-use blocks by calling a function when paused in the debugger. I believe you can make a checkpoint and then later dump all blocks that are in-use that were allocated since the checkpoint.

If you insert some code in your custom allocator, the Purify SDK can even use your allocator to do its magic. You could make calls to it at run time to dump metrics or react to issues.

As you can see, unreferencable blocks are the real leaks. We only have to clear out all in-use blocks because we don't use tools like Purify, and have overreact and compromise. I don't like the busy work of clearing every last legitimate in-use block. I don't think I should have-to-have-to. (As long as I'm careful about my "other" resources like I/O buffers.) It makes for better code if I do, but if I want to trade time for quality or one feature for another, I still have to clear the real leaks, and like the option of ignoring the other blocks swept up by the tradition definition.

It has a bunch of other cool stuff too. Like knowing when you access memory that is uninitialized, or already deallocated. It knows when you run off the end of an array. It can instrument third party object code even if you don't have source, and it doesn't use your allocator.

Of course there is a runtime performance hit, but it isn't super bad. And you can control which modules are instrumented, so you might exclude your rendering for example.

It has a UI that integrates to Visual Studio and gives you a nice report of these various kinds of memory errors at the end of your run, or whenever you ask. It will even take you to the offending line of code.

Don't balk at the price either. Just remember the amount of time the poor guy spent that was stuck with clear every last in-use block. It more than pays for itself very quickly.

2 comments:

  1. Nice, but in most game and apps I've shipped the problems weren't usually these "precise" or "traditional" leaks. They were logical faults were systems failed to let go of references when they should have. That could be between level loads, after initialization, etc.. To purify, these would all look valid. And, in hard to track cases, code that checks for leaks at app shutdown wouldn't catch them either since there was logic that cleaned them up them.

    The negatives: over use of memory on consoles that have little, or worse, references to memeory that has been reclaimed.

    I've shipped games using the philosophy of "kill the process, it'll clean up the memory". While it worked, there was strong consensus that we'd painted ourselves into that corner, and it wasn't a good place to be. I'd not do that again.

    ReplyDelete
  2. I've lived where you are (be clean) a long time and have pushed for it to the point of irritating people. But I am beginning to question it because of the development cost.

    Here's the thing. The "bloat" that you talk about where the system doesn't let go of references can be instrumented. How fat are your caches getting, how much space is taken up by the scene graph or textures. Because those blocks are "controlled", the app is able to do this (given foresight and diligence). And as you say those bloats might even be missed by traditional approaches because the caches are deleted at shutdown or whatever. So even with traditional leak detection, you need this kind of instrumentation.

    I've shipped with leaks too, and used to swear I'd never do it again. But I bet it will happen because I (now) think it doesn't have great return on investment. Teams eventually give up on finding the last few leaks, fall off the wagon, and shrug.

    Well. Even if your team decides to invest in the 100% solution, you should buy Purify. It will speed things up incredibly.

    Now. Let's talk about debugging "leaked" reference counts. Heh. Another post is burbling.

    ReplyDelete