Skip to main content

how well do you know python, part 9

(Today's questions are very CPython-specific, but that hasn't stopped me before. :)

I spent some time today looking for the source of a bug that caused my program to leak memory. A C module ultimately proved to be at fault; before figuring that out, though, I suspected that something was hanging on to data read over a socket longer than it should. I decided to check this by summing the length of all string objects:

>>> import gc
>>> sum([len(o) for o in gc.get_objects() if isinstance(o, str)])
0

No string objects? Can't be. Let's try this:

>>> a = 'asdfjkl;'
>>> len([o for o in gc.get_objects() if isinstance(o, str)])
0

So:

  1. (Easy) Why don't string objects show up for get_objects()?
  2. (Harder) How can you get a list of live string objects in the interpreter?

Comments

Anonymous said…
1) The get_objects() call only returns objects tracked by the GC; I guess strings don't need tracking as they cannot refer to other objects, so cannot create cyclic references.

2) I think gc.get_referrers('') should do it.
Calvin Spealman said…
gc.get_referrers will not work for the same reason. It only gets objects that refer to specific other objects, and strings (and ints, longs, floats, and bools) do not refer to anything.
Tim Lesher said…
2) is interesting. My first thought would be to use the inspect module to walk frames, examining f_locals and f_globals to enumerate objects.

I don't know how that interacts with threads, though, and it smells like a great place for Heisenberg's Uncertainty Principle to jump up and bite you.
Anonymous said…
gc.get_referrers('') returns all tracked objects that refer to at least the empty string; it is relatively easy to retrieve the strings from that. As all strings are must be referenced by something I think that this is the way to get them all.

When trying it out in a clean python shell I got an awful lot of strings back... Don't try it in ipython!
Anonymous said…
gc.get_referrers("___some_dummy_string") works slightly better (may have to assign a temp variable though) because you only get a few referrers and one of them is a giant dict of strings with both key/value the same.

If gc tracks (most) container objects, then if you wanted to track integers or other untracked objects maybe you could walk gc.get_objects() yourself. Although that always makes me a bit queasy. Maybe the inspect module could add an object-tree walking function to simplify things and warn about common pitfalls.
Jonathan Ellis said…
(The dict of strings the immediately previous poster refers to is the dict of interned strings. This does _not_ contain all strings in use.)
Anonymous said…
Here's one solution. I'm not sure if it gets every possible string, but it finds about 20,000.

>>> import sys
>>> from twisted.python import reflect
>>> reflect.findInstances(sys.modules, str)
[lots of output]

findInstances is implemented in terms of objgrep, which takes the starting object you give it (sys.modules in this case) and recursively follows all of its references until it can find no new objects. It doesn't return the objects it finds, of course: it returns strings describing the reference path to each object it finds. Since the visitor is arbitrary and user-specified, changing this is a simple exercise left for the reader.
Anonymous said…
Fortunately the gc module can mostly handle this one, so one doesn't have to walk frames or locals()/globals() and so on.

The short answer is that you need to recursively expand the list from gc.get_objects() with gc.get_referents(). Make sure to expand each object only once to avoid duplicates. (You have to do it recursively because gc.get_objects() doesn't return all container objects.)

Actual code for this does not fit in the margin of this comment, so you can find it here (at the bottom).
Anonymous said…
I forgot: I don't know the answer to your first question, and I'd like to.

It's clearly not that get_objects() returns only the objects that contain other objects, because it doesn't return all such objects. (Otherwise one wouldn't need recursive expansion of its results, and one does.)

I took a quick look at the 2.4 CPython source code for the gc module, but couldn't see anything obvious to explain it. (get_objects() returns all the objects on the three gc generations lists, but I couldn't decode when objects did or didn't end up on them.)
Anonymous said…
It turns out I was wrong about some things, especially gc.get_objects() not returning all container objects (my test code had bugs that I missed). Also, expanding gc.get_objects() can miss objects that are being referred to only by C code (eg, Unix signal handlers set through the signal module). Full details here.

If you really need every live dynamically allocated object, you turn out to need a debug build of CPython.
Viagra Online said…
If blog author can reply for someone's comment follow his comment, I think it is really cool.
If blog author can reply for someone's comment follow his comment, I think it is really cool.

Popular posts from this blog

Why schema definition belongs in the database

Earlier, I wrote about how ORM developers shouldn't try to re-invent SQL . It doesn't need to be done, and you're not likely to end up with an actual improvement. SQL may be designed by committee, but it's also been refined from thousands if not millions of man-years of database experience. The same applies to DDL. (Data Definition Langage -- the part of the SQL standard that deals with CREATE and ALTER.) Unfortunately, a number of Python ORMs are trying to replace DDL with a homegrown Python API. This is a Bad Thing. There are at least four reasons why: Standards compliance Completeness Maintainability Beauty Standards compliance SQL DDL is a standard. That means if you want something more sophisticated than Emacs, you can choose any of half a dozen modeling tools like ERwin or ER/Studio to generate and edit your DDL. The Python data definition APIs, by contrast, aren't even compatibile with other Python tools. You can't take a table definition

Python at Mozy.com

At my day job, I write code for a company called Berkeley Data Systems. (They found me through this blog, actually. It's been a good place to work.) Our first product is free online backup at mozy.com . Our second beta release was yesterday; the obvious problems have been fixed, so I feel reasonably good about blogging about it. Our back end, which is the most algorithmically complex part -- as opposed to fighting-Microsoft-APIs complex, as we have to in our desktop client -- is 90% in python with one C extension for speed. We (well, they, since I wasn't at the company at that point) initially chose Python for speed of development, and it's definitely fulfilled that expectation. (It's also lived up to its reputation for readability, in that the Python code has had 3 different developers -- in serial -- with very quick ramp-ups in each case. Python's succinctness and and one-obvious-way-to-do-it philosophy played a big part in this.) If you try it out, pleas

A review of 6 Python IDEs

(March 2006: you may also be interested the updated review I did for PyCon -- http://spyced.blogspot.com/2006/02/pycon-python-ide-review.html .) For September's meeting, the Utah Python User Group hosted an IDE shootout. 5 presenters reviewed 6 IDEs: PyDev 0.9.8.1 Eric3 3.7.1 Boa Constructor 0.4.4 BlackAdder 1.1 Komodo 3.1 Wing IDE 2.0.3 (The windows version was tested for all but Eric3, which was tested on Linux. Eric3 is based on Qt, which basically means you can't run it on Windows unless you've shelled out $$$ for a commerical Qt license, since there is no GPL version of Qt for Windows. Yes, there's Qt Free , but that's not exactly production-ready software.) Perhaps the most notable IDEs not included are SPE and DrPython. Alas, nobody had time to review these, but if you're looking for a free IDE perhaps you should include these in your search, because PyDev was the only one of the 3 free ones that we'd consider using. And if you aren