Friday, September 21, 2007

That wasn't the pigeonhole I expected

I went to the BYU CS alumni dinner tonight. At one point they briefly put everyone's name and position on a projector, one at a time. (At five seconds apiece it wasn't as tedious as it sounds.)

When it was my turn, it announced "Jonathan Ellis, System Administrator."

What the hell?

It turns out that when I RSVP'd I said I was a "python kung-fu master & sysadmin of last resort." (In the sense that, if you really can't find a better sysadmin, I know enough to be dangerous.)

Don't bother trying to be clever around bureaucrats.

Saturday, September 08, 2007

Utah Open Source Conference 2007

The first Utah Open Source Conference finished today. I heard that they had close to 300 attendees -- not bad at all for a freshman effort.

I reprised presentations that I've given before, on SQLAlchemy and distributed source control. My slides are on the presentations page (although if you've seen my slides from either before, there's not much new there -- I got lucky, SA 0.4 isn't stable yet so I stuck with 0.3.10).

I had to work Friday so I missed a lot of presentations, but of the one I saw my favorite was on Ganglia, which I hadn't heard of before but which looks quite useful for anyone running a bunch of servers that takes uptime and qos seriously. (This was actually Brad Nicholes's third presentation of the conference -- he must have been busy!)

Afterwards I went to the board games BoF and played Mag Blast. Fun little game.

Wednesday, September 05, 2007

What it means to "know Python"

Since Adam Barr replied to my post on his book, I'd like to elaborate a little on what I said.

Adam wrote,

[F]or me, "knowing" Python means you understand how slices work, the difference between a list and a tuple, the syntax for defining a dictionary, that indenting thing you do for blocks, and all that. It's not about knowing that there is a sort() function.

In Python, reinventing sort and split is like a C programmer starting a project by writing his own malloc. It just isn't something you see very often. Similarly, I just don't think you can credibly argue that a C programmer who doesn't know how to use malloc really knows C. At some level, libraries do matter.

On the other hand, I wouldn't claim that you must know all eleventy jillion methods that the Java library exposes in one way or another to say you know Java.

What is the middle ground here?

I think the answer is something along the lines of, "you have to get enough practice actually using the language to be able to write idiomatic code." That's necessarily going to involve picking up some library knowledge along the way.

This made me think. What are the most commonly used Python modules? I decided to scan the Python Cookbook's code base and find out. This is a fairly large sample (over 2000 recipes), and further attractive in that most of the scripts there are reasonably standalone, so they're not filled with importing lots of non-standard modules. The downside is there is code dating back at least to the very ancient Python 1.5 version.

In 2000+ source files and almost 4000 imports of stdlib modules, here are the frequency counts of imported modules.

Is this a reasonable list? I obviously think I qualify as knowing Python well enough to blog about it. Of the modules above the 80% line, _winreg, win32con, and win32api are platform-specific; new is deprecated, string isn't officially deprecated but should be, and __future__ isn't really a module per se. I believe I've used all of the rest but xmlrpclib at some point, although my line of comfort-without-docs would be only about the 60% mark. I think anyone who programs professionally will quickly get to knowing well at least the modules up to the 50% line.

sys473
os302
24%
time210
re145
35%
string140
random103
threading66
socket57
os.path52
types50
Tkinter47
50%
math43
win32com.client42
__future__41
traceback40
itertools38
doctest37
urllib35
cStringIO33
struct32
60%
win32api31
getopt29
thread29
ctypes28
StringIO28
inspect26
win32con25
copy25
cPickle25
operator24
datetime23
cgi22
70%
Queue22
urllib220
md520
base6420
xmlrpclib19
sets19
optparse19
logging18
weakref18
shutil17
unittest17
pprint16
urlparse15
getpass15
httplib15
pickle15
_winreg14
UserDict13
signal13
80%

For those interested, a tarball of the recipes I scanned is here, so you don't need to scrape the Cookbook site yourself. The import scanning code is simple enough:

import os, re, compiler
from collections import defaultdict

# define an AST visitor that only cares about "import" and "from [x import y]" nodes
count_by_module = defaultdict(lambda: 0)
class ImportVisitor:
    def visitImport(self, t):
        for m in t.names:
            if not isinstance(m, basestring):
                m = m[0] # strip off "as" part
            count_by_module[m] += 1
    def visitFrom(self, t):
        count_by_module[t.modname] += 1

# parse
for fname in os.listdir('recipes'):
    try:
        ast = compiler.parseFile('recipes/%s' % fname)
    except SyntaxError:
        continue
    compiler.walk(ast, ImportVisitor())
    print 'parsed ' + fname

# some raw stats, for posterity
counts = count_by_module.items()
total = sum(n for module, n in counts)
print '%d/%d total/unique imports' % (total, len(counts))

# strip out non-stdlib modules
for module in count_by_module.keys():
    try:
        __import__(module)
    except (ImportError, ValueError):
        del count_by_module[module]
        
# post-stripped stats
counts = count_by_module.items()
total = sum(n for module, n in counts)
print '%d/%d total/unique imports in stdlib' % (total, len(counts))
counts.sort(key=lambda (module, n): n)

# results
subtotal = 0
for module, n in reversed(counts):
    subtotal += n
    print '%s\t%d' % (module, n)
    print '%f' % (float(subtotal) / total)

Tuesday, September 04, 2007

Merging two subversion repositories

Update: an anonymous commenter pointed out that yes, there is a (much!) better way to do this with svnadmin load --parent-dir, which is covered in the docs under "repository migration." All I can say in my defense is that it wasn't something google thought pertinent. So, for google's benefit: how to merge subversion repositories. Thanks for the pointer, anonymous!

I needed to merge team A's svn repository into team B's. I wanted to preserve the history of team A's commits as much as reasonably possible. I would have thought that someone had written a tool to do this, but I couldn't find one, so I wrote this. (Of course, now that I'm posting this, I fully expect someone to point me to a better, pre-existing implementation that I missed.)

The approach is to take a working copy of repository B, add a directory for A's code, and for each revision in A's repository, apply that to the working copy and commit it. This would be easy if svn merge would allow applying diffs from repository A into a working copy of repository B, but it does not. I can't think of a technical reason for this. (In fact, I seem to remember that early versions of the svn client did allow this, with dire warnings, but I could be mistaken and I don't have a 1.1 client around anymore.)

So I tried instead to use "svn diff |patch -p0", which worked great up until the first commit with a binary file. Oops. For the final version I ended up having to create a working copy for A, update to each revision there, then rsync to the right point in working copy B and call the "svnaddremove" script to mark files added or deleted. (This is suboptimal since we can get the exact changed paths from svn, and just copy those files over, but rsync is fast enough as long as your working copies stay in cache. The update and commit steps both consistently took longer than rsync in my timing.)

My script does not try to be intelligent about copies or moves that svn knows about. Team A did not use branches or tags much so I didn't put the effort in to deal with those the "right" way (which would be to also issue a cp/mv on B's working copy to preserve history). It also uses unix users to commit revisions with the same name as the original commit. Doing this obviously requires at least access to the repository server to add the right users. I used "svn log -q |grep ^r |awk '{print $3}' |sort |uniq |useradd."

Final note: the perl script in svnaddremove is a long way of writing "awk {print $2}", except that it preserves filenames with spaces in them. There is probably a much more clever way of doing this, too.

Here, then, is the merge script:

#!/usr/bin/python

# usage: svnimport <source wc path> <target wc path> <revstart> <revend>
# e.g. svnimport liberte-source trunk/liberte 1 2000

from subprocess import Popen, PIPE
try:
    from xml.etree import cElementTree as ET
except ImportError:
    from elementtree import ElementTree as ET
import sys, time

def system(*args):
    p = Popen(args, stdout=PIPE, stderr=PIPE)
    out, err = p.communicate()
    if err:
        raise err
    return out

# super-minimal log scraper
# for a better one see hgsvn's svnclient.py, http://cheeseshop.python.org/pypi/hgsvn
def parse_date(svn_date):
    date = svn_date.split('.', 2)[0]
    return time.strftime("%Y-%m-%d", time.strptime(date, "%Y-%m-%dT%H:%M:%S"))
def parse_svn_log_xml(xml):
    tree = ET.fromstring(xml)
    for entry in tree.findall('logentry'):
        d = {}
        d['revision'] = int(entry.get('revision'))
        author = entry.find('author')
        d['author'] = author is not None and author.text or None
        d['message'] = entry.find('msg').text or ""
        d['date'] = parse_date(entry.find('date').text)
        yield d
        
def edited_message(entry):
    msg = entry['message'].strip().replace('\r\n', '\n')
    addendum = '[original revision %s committed %s]' % (entry['revision'], entry['date'])
    if msg:
        return msg + '\n' + addendum
    return addendum

sourcepath, targetpath, revstart, revstop = sys.argv[1:]
# rsync foo bar and rsync foo/ bar/ are very different!
if not sourcepath.endswith('/'):
    sourcepath += '/'
if not targetpath.endswith('/'):
    targetpath += '/'

xml = system('svn', 'log', sourcepath, '--xml', '-r', '%s:%s' % (int(revstart) + 1, revstop))
for entry in parse_svn_log_xml(xml):
    revno = entry['revision']
    print 'merging revision %d by %s' % (revno, entry['author'])
    # merge in the revision
    system('svn', 'up', '-r', str(revno), sourcepath)
    print '\trsync'
    system('rsync', '-a', '--exclude=.svn', '--delete', sourcepath, targetpath)
    system('/tmp/svnaddremove', targetpath) # svn should add this.  hg already did.
    # commit as the correct author, if available
    author = entry['author']
    print '\tchown'
    system('chown', '-R', author, targetpath)
    quoted_message = edited_message(entry).replace('"', "'")
    print '\tci'
    system('su', author, '-c', 'svn ci %s -m "%s"' % (targetpath, quoted_message))

And here is svnaddremove:

#!/bin/bash

# odd, xargs is invoking svn add/rm w/ no args when grep returns no lines.
# fix that with the ifs.
# (don't use grep -q or svn gets pissed about broken pipe.)

if svn st $1 | grep ^\? > /dev/null; then
  svn st $1 | perl -ne 'chomp; @Fld = split(q{ }, $_, -1); if (/^\?/) { shift @Fld; print join(q{ }, @Fld) . "\n"; }' | xargs -n 1 -i svn add "{}"
fi
if svn st $1 | grep ^\! > /dev/null; then
  svn st $1 | perl -ne 'chomp; @Fld = split(q{ }, $_, -1); if (/^\!/) { shift @Fld; print join(q{ }, @Fld) . "\n"; }' | xargs -n 1 -i svn rm "{}"
fi