Tuesday, September 13, 2005

how well do you know python, part 9

(Today's questions are very CPython-specific, but that hasn't stopped me before. :)

I spent some time today looking for the source of a bug that caused my program to leak memory. A C module ultimately proved to be at fault; before figuring that out, though, I suspected that something was hanging on to data read over a socket longer than it should. I decided to check this by summing the length of all string objects:

>>> import gc
>>> sum([len(o) for o in gc.get_objects() if isinstance(o, str)])
0

No string objects? Can't be. Let's try this:

>>> a = 'asdfjkl;'
>>> len([o for o in gc.get_objects() if isinstance(o, str)])
0

So:

  1. (Easy) Why don't string objects show up for get_objects()?
  2. (Harder) How can you get a list of live string objects in the interpreter?

12 comments:

Martijn Pieters said...

1) The get_objects() call only returns objects tracked by the GC; I guess strings don't need tracking as they cannot refer to other objects, so cannot create cyclic references.

2) I think gc.get_referrers('') should do it.

Calvin Spealman said...

gc.get_referrers will not work for the same reason. It only gets objects that refer to specific other objects, and strings (and ints, longs, floats, and bools) do not refer to anything.

Tim Lesher said...

2) is interesting. My first thought would be to use the inspect module to walk frames, examining f_locals and f_globals to enumerate objects.

I don't know how that interacts with threads, though, and it smells like a great place for Heisenberg's Uncertainty Principle to jump up and bite you.

Martijn Pieters said...

gc.get_referrers('') returns all tracked objects that refer to at least the empty string; it is relatively easy to retrieve the strings from that. As all strings are must be referenced by something I think that this is the way to get them all.

When trying it out in a clean python shell I got an awful lot of strings back... Don't try it in ipython!

Anonymous said...

gc.get_referrers("___some_dummy_string") works slightly better (may have to assign a temp variable though) because you only get a few referrers and one of them is a giant dict of strings with both key/value the same.

If gc tracks (most) container objects, then if you wanted to track integers or other untracked objects maybe you could walk gc.get_objects() yourself. Although that always makes me a bit queasy. Maybe the inspect module could add an object-tree walking function to simplify things and warn about common pitfalls.

Jonathan Ellis said...

(The dict of strings the immediately previous poster refers to is the dict of interned strings. This does _not_ contain all strings in use.)

exarkun said...

Here's one solution. I'm not sure if it gets every possible string, but it finds about 20,000.

>>> import sys
>>> from twisted.python import reflect
>>> reflect.findInstances(sys.modules, str)
[lots of output]

findInstances is implemented in terms of objgrep, which takes the starting object you give it (sys.modules in this case) and recursively follows all of its references until it can find no new objects. It doesn't return the objects it finds, of course: it returns strings describing the reference path to each object it finds. Since the visitor is arbitrary and user-specified, changing this is a simple exercise left for the reader.

Joshua C Gilbert said...

What about doing it the way I'd do it in Perl? Walk locals() and globals() and isinstance each object you hit. Traverse down modules, classes, etc checking each one. Also, keep a trail of breadcrumbs so you don't get hit by recursive includes.

Fun.

Chris Siebenmann said...

Fortunately the gc module can mostly handle this one, so one doesn't have to walk frames or locals()/globals() and so on.

The short answer is that you need to recursively expand the list from gc.get_objects() with gc.get_referents(). Make sure to expand each object only once to avoid duplicates. (You have to do it recursively because gc.get_objects() doesn't return all container objects.)

Actual code for this does not fit in the margin of this comment, so you can find it here (at the bottom).

Chris Siebenmann said...

I forgot: I don't know the answer to your first question, and I'd like to.

It's clearly not that get_objects() returns only the objects that contain other objects, because it doesn't return all such objects. (Otherwise one wouldn't need recursive expansion of its results, and one does.)

I took a quick look at the 2.4 CPython source code for the gc module, but couldn't see anything obvious to explain it. (get_objects() returns all the objects on the three gc generations lists, but I couldn't decode when objects did or didn't end up on them.)

Chris Siebenmann said...

It turns out I was wrong about some things, especially gc.get_objects() not returning all container objects (my test code had bugs that I missed). Also, expanding gc.get_objects() can miss objects that are being referred to only by C code (eg, Unix signal handlers set through the signal module). Full details here.

If you really need every live dynamically allocated object, you turn out to need a debug build of CPython.

Viagra Online said...

If blog author can reply for someone's comment follow his comment, I think it is really cool.
If blog author can reply for someone's comment follow his comment, I think it is really cool.