One of the things that sucks about XPCOM is that you have to deal with reference counting. It's hard and prone to errors, and Mozilla leaks like a sieve because of it. Unlike the good old fashioned malloc() and free() model where memory gets allocated in exactly one place and freed in exactly one other place, reference counting is distributed all over. There may be twenty different spots in the code where a single object is AddRef()-ed. And if just one of those AddRef()-ers forgets to Release(), well, you're screwed.

Traditional leak tracking tools like Purify don't help much either. They'll tell you that you leaked an object, but they won't help you track down the twenty different clients that AddRef()-ed it, let alone the joker that forgot to Release() it.

This crude set of tools attempts to address that problem. It's not a panacea, but it at least gives some insight into who is AddRef()-ing whom.

From 50,000 feet, here's what happens.

You discover that your FooImpl object is leaking, maybe Bruce Mitchener tells you, maybe you notice on your own because your destructor is never called. You cringe and moan and later the bug for 3 or 4 milestones. But since you know about this tool, you eventually roll up your sleeves and start working on it.
First you add a #define to the top of your FooImpl.cpp file. Behind the scenes, this changes how the NS_IMPL_ADDREF and NS_IMPL_RELEASE macros are implemented.
You recompile FooImpl.cpp and re-link its DLL (that's it!), set an environment flag, and run. (N.B., you don't need to rebuild the entire tree, just the files that contain object you want to track.)
As you run, you notice piles and piles of information will start to spew out to the console. Specifically, as your object is AddRef()-ed and Release()-ed, a stack trace is generated, along with the operation (AddRef or Release), this (i.e., the object that just got operated on), and the current reference count of your object. This mountain of information, although impressive, is useless in its current form. So, you re-run and redirect your output to a file.
You next run Perl script #1 over the resulting log file. This Perl script will pick out the instances of objects that leaked. You choose one of the objects that's particularly interesting to you.
You now run Perl script #2 over the log file. This script is the Fancy Magic. It takes each stack trace and strings it together into a call graph. Each node in the graph represents a call site, and has a "balance factor" which is the total number of AddRef() operations that it has been included in minus the total number of Release() operations that it has been included in. (I told you it was Fancy Magic.)

So what does all that mean? The cool part -- you were waiting for the cool part -- is that you can look at this graph and see what subtrees are "balanced"; i.e., total number of AddRef()s equals total number of Release()-es. You know you don't need to worry about those trees because no evil leakage happened there.

For trees that are out of balance, you need to dig a little bit deeper. Subtrees get out of balance when one code path AddRef()s the object, and a code path somewhere else does the corresponding Release().

Like I said, it's not a panacea, but you can start to play Mah Jongg with the out-of-balance trees, proving to yourself in each case that the AddRef() from one tree matches with the Release() in another. In short, it does a decent job of directing you to the places you need to verify in your code.

details

Instrumentation. The instrumentation is pretty painless. You pick the object that you want to "spy" on, and add the following line before including any header files:

#define MOZ_LOG_REFCNT

For those of you interested in the details, this alters the NS_IMPL_ADDREF and NS_IMPL_RELEASE macro declarations that are #define-d in nsISupportsUtils.h. Specifically, it inserts calls to nsTraceRefcnt::LogAddRef() and nsTraceRefcnt::LogRelease(), whose implementations live in nsTraceRefcnt.cpp. Note that this scheme is similar to, but not the same as, the scheme that Kipp (?) and others used for MOZ_TRACE_XPCOM_REFCNT. Their scheme altered the NS_ADDREF() and NS_RELEASE() macros, which required a full recompile to use, and generated way more information than would ever be useful at this point [1].

Enabling Runtime Logging. Even though you've recompiled your file with #define MOZ_LOG_REFCNT, you still need to set a runtime environment variable to produce output. (This is mostly because it produces a boat-load of information, and slows stuff to a crawl if your objects see a lot of action. Being able to turn it off without having to recompile is nice.)

It uses the PR_LOG_TEST() macros to determine when to print, so do whatever is appropriate for your platform to set NSPR_LOG_MODULES to xpcomrefcnt:5. Yes, case is important.

Now when you run, you should see lots of information printed to the console when your objects are AddRef()-ed and Release()-ed. Since you probably don't want to pick through it by hand, redirect it to a log file. Don't worry about the other cruft that apprunner or viewer prints out. The post-processing scripts can deal.

Postprocessing Step 1: Finding the Leakers. First you have to figure out which objects leaked. There's a Perl script that does this [2]. It grovels through the log file, and figures out which objects got allocated (it knows because they were just allocated because they got AddRef()-ed and their refcount became 1). It adds them to a list. When it finds an object that got freed (it knows because its refcount goes to 0), it removes it from the list. Anything left over is leaked.

The script is called find-leakers.pl. So, depending on your platform, do something like:

% perl -w find-leakers.pl my-leaks.log

(Replace my-leaks.log with your logfile.) This will print out a list of pointers:

0x00253ab0 (1)
0x00253ae0 (2)
0x00253bd0 (4)

The number in parenthesis is the order in which it was allocated, if you care. Pick one for use with Step 2.

Postprocessing Step 2: Building the Balance Tree. Now that you've picked an object that leaked, you can build a "balance tree" (anyone who can think of a better name feel free to let me know). This process takes all the stack AddRef() and Release() stack traces and munges them into a call graph. Each node in the graph represents a call site. Each call site has a "balance factor", which is positive if more AddRef()s than Release()-es have happened at the site, zero if the number of AddRef()s and Release()-es are equal, and negative if more Release()-es than AddRef()s have happened at the site.

To build the balance tree, run make-tree.pl; e.g.,

% perl -w make-tree.pl --object 0x00253ab0 my-leak.log

Note that you specify the object that you want make-tree.pl to examine. This will build an indented tree that looks something like this (except probably a lot larger and leafier):

.root: bal=1
  main: bal=1
    DoSomethingWithFooAndReturnItToo: bal=2
      NS_NewFoo: bal=1

Let's pretend in our toy example that NS_NewFoo() is a factory method that makes a new foo and returns it. DoSomethingWithFooAndReturnItToo() is a method that munges the foo before returning it to main(), the main program.

What this little tree is telling you is that you leak one refcount overall on object 0x00253ab0. But, more specifically, it shows you that:

NS_NewFoo() "leaks" a refcount. This is probably "okay" because it's a factory method that creates an AddRef()-ed object.
DoSomethingWithFooAndReturnItToo() leaks two refcounts. Hmm...this probably isn't okay, especially because...
main() is back down to leaking one refcount.

So from this, we can deduce that main() is correctly releasing the refcount that it got on the object returned from DoSomethingWithFooAndReturnItToo(), so the leak must be somewhere in that function.

So now say we go fix the leak in DoSomethingWithFooAndReturnItToo(), re-run our trace, grovel through the log "by hand" to find the object that corresponds to 0x00253ab0 in the new run, and run make-tree.pl. What we'd hope to see is a tree that looks like:

.root: bal=0
  main: bal=0
    DoSomethingWithFooAndReturnItToo: bal=1
      NS_NewFoo: bal=1

That is, NS_NewFoo() "leaks" a single reference count; this leak is "inherited" by DoSomethingWithFooAndReturnItToo(); but is finally balanced by a Release() in main().

hints

Clearly, this is an iterative and analytical process. Maybe somebody smarter than me can figure out ways to automate parts of it. To date, I've figured out some tricks.

Ignoring balanced trees. The make-tree.pl script accepts an option --ignore-balanced, which tells it not to bother printing out the children of a node whose balance factor is zero. This can help remove some of the clutter from an otherwise noisy tree.

Playing Mah Jongg. An unbalanced tree is not necessarily an evil thing. More likely, it indicates that one AddRef() is cancelled by another Release() somewhere else in the code. So the game is to try to match them with one another.

Excluding Functions To aid in this process, you can create an "excludes file", that lists the name of functions that you want to exclude from the tree building process (presumably because you've matched them). make-tree.pl accepts the option --exlude [file], where [file] is a newline-separated list of function names that will be excluded from consideration while building the tree. Specifically, any call stack that contains that call site will not contribute to the computation of balance factors in the tree. [3].

pricing & availability

As of this writing, the stack tracing code is implemented for Win32 and i386 Linux (compiled with egcs and glibc 2.0 and 2.1). Dontations gladly accepted; Bourbon preferred over other currencies.

The Perl scripts, of course, require only Larry Wall's finest (5.00504 seems to work for me).

find-leakers.pl
make-tree.pl

credits

I stole the stack walking code from Kipp Hickman and Matt Pietrek (see this article). For Linux, Mike Shaver, Bruce Mitchener, and Ramiro Estrugo all helped me get things right. Mucho gusto.

notes

1. Since I've done all this while Kipp and Troy are away, I've left their stuff in the tree. I'm sure when they return from sabattical, they will see the error in their ways and promptly embrace the New Way.

2. No, grep won't work. It has a maximum line length of 2Kb or something like that, and the stack crawls regularly get up to about 4-5Kb per trace.

3. Using the --excludes option is a black art that I don't fully understand. I've often gotten myself way confused with it. So be careful.

$Id: refcnt-balancer.html,v 1.2 1999/06/16 09:05:38 waterson%netscape.com Exp $