## Sunday, December 30, 2007

### Scala: first impressions

I'm reading the prerelease of the Scala book, since I'm working for a heavily Java-invested organization now and programming in Java feels like running a race with cement shoes. Politically, Jython doesn't seem like an option; Scala might be an easier sell.

Here's some of my impressions going through the book.

### Scala is a decent scripting language

[updated this section thanks to comments from anonymous, Jörn, and Eric.]

Here's how you can loop through each line in a file: Python:

import sys
for line in open(sys.argv[0]):
print line

Scala:
import scala.io.Source
Source.fromFile(args(0)).getLines.foreach(println)


The scala interpreter also has a REPL, which is nice for experimenting.

### Getters and setters

One thing I despise in Java is the huge amount of wasted lines of code dedicated to getFoo and setFoo methods. Sure, your IDE can autogenerate these for you, but it still takes up lines in your editor and effort to mentally block them out when examining an unfamiliar class to determine what it does.

C# has Python-style properties, so in theory could be virtually free of this kind of boilerplate, since there is no syntactic difference between "foo.x = y" whether x is a raw field or a property. So, the right thing to do, which you'll see in Python code, is using a raw public field until you actually need extra logic, at which point you replace it with a property and nobody's code breaks.

But C# wasn't far enough removed from Java culturally so everyone writes boilerplate properties instead of boilerplate getters and setters. (I'm aware that in a compiled language like C# it makes sense to start with properties for classes in certain kinds of libraries but these make up a vanishingly small number of actual classes in the wild.)

This is a long way of saying that Scala's properties make a lot of sense given the audience they are targetting (Java programmers). Scala vars are automatically turned into properties (or getters and setters, if you prefer to think in terms of those). So even the most obstinate fan of boilerplate code has absolutely no reason to keep writing unnecessary getters and setters.

If you need to manually tweak your properties later, you can magically define a setter for "field" by naming a method "field_=". This seems a bit like a one-off hack rather than part of a well-designed system, but I can live with it. (Since parentheses are optional for a scala method taking no arguments, any no-argument method call is already syntactically indistiguishable from a raw field read.)

### Scala doesn't know what it wants to be when it grows up, yet

Scala contains a lot of features from a lot of different influences, which means there's often competing styles to use; the young scala community hasn't yet decided which to emphasize. For instance, iteration vs traversal -- "for (item <- items)" vs "items.foreach" -- or even whether to leave out optional (inferred) type declarations.

Perhaps this is merely a failure of one book trying too hard to make scala be all things to all readers.

### Random thoughts

• Scala needs something like Python's enumerate function; if you want to loop each object in a collection but you need its index too, you have to do a manual for loop.
• Using parenthesis for collection access ("array(0); map("foo"))") makes it unnecessarily difficult to tell if you're looking at a method call or a collection access.
• The scala stdlib is not very google-able yet; if you don't know exactly the class you are looking for ahead of time, you probably won't find it. For example, when looking for a scala object that could iterate through lines in a file, I correctly guessed the scala.io package, but would never have bothered looking at Source until more googling turned it up in a blog entry.

### Conclusion and disclaimer

This is long enough; I'll probably write more as I go through more of the book. Patterns and implicit conversions are particularly interesting.

I was attracted to scala while looking for a sane alternative to Java on the JVM and it looks like scala might "drag Java programmers halfway to Python," to paraphrase Guy Steele. And I'm happy to be corrected on anything I've written here.

## Friday, December 28, 2007

### Troubleshooting the ps3 wireless network connection, including error 80130128

My father got a ps3 for Christmas, but ran into some problems getting it on his wireless network. The first one was "connection error 80130128" after configuring it to use DHCP. I couldn't google anything useful about this; just a few other hapless victims asking if anyone had any ideas. Fortunately Dad had his laptop there too and noticed Windows complaining that two machines on the network were both using the same IP. So, over the phone, I walked him through setting up the ps3 with a static address:
1. on his laptop, run -> cmd
2. ipconfig
3. Read the "gateway" ip. Put that into his browser to go to his router's admin page
4. Find the DHCP settings for his router to see what range of IPs it hands out; pick one outside that range
5. Set up the ps3 with that IP, the router IP as primary dns, and an opendns server as secondary
This made the connection test happy. But when he tried to go to the playstation store, it gave a DNS error. If he repeated the connection test again, it failed too. "Well," I told him, "It's supposed to try both DNS servers. But we can try setting the primary DNS server to opendns as well." Once he did that, everything worked.

## Friday, November 16, 2007

### Reed-Solomon libraries

If you want to run a multi-petabyte storage system then you don't want to do it with Raid 5 or Raid 6; with modern disks' ~3% per year failure rate, that's 300 a year when you have 10000 disks and the odds start to get pretty good (relatively speaking) that you'll face permanent data loss at some point when you lose a third disk from an array while two are rebuilding. And of course monitoring and replacing disks in lots of small arrays is manpower-intensive, which to investors translates as "expensive."

You probably don't want to go with triplication, either; disks are cheap, but not so cheap that you want to triple your hardware costs unnecessarily. While storing multiple copies of frequently used data is good, all your data probably isn't "frequently used."

What is the solution? As it turns out, Raid is actually a special case of Reed-Solomon encoding, which lets you specify any degree of redundancy you want. You can be safer than triplication with a fraction of the space needed.

I was prompted to write this because Mozy open-sourced the Reed-Solomon library I used while I was there, librs, complete with Python bindings. The original librs we used at Mozy was written by Byron Clark, a formidible task. Later we switched to the version you see on sourceforge, based on Plank's original encoder. I wasn't involved with librs at all except to fix a couple reference leaks in the Python wrapper.

But if you're actually looking for an rs library to use, Alen Peacock, who is much more knowledgeable than I about the gory details involved here, tells me that if you are starting from scratch the two libraries you should evaluate are zfec, which also comes with Python bindings, and Jerasure which is an updated -- i.e., probably faster than his first -- encoder by Plank. (Jerasure has nothing to do with Java.)

## Thursday, October 18, 2007

### Utah Data Recovery

About three years ago (so pre-Mozy and definitely pre-Mac Mozy) my brother had his powerbook hard disk die. As in, not just mostly dead -- it would not power up. It had a lot of stuff on it that he didn't want to lose, but he felt like the usual suspects who charge $1k to$2k for data recovery were a rip off. So he hung onto the disk in case a cheaper option came along.

Then just recently when I saw some people on a local linux group mailing list recommend utah data rescue I suggested to my brother that he give it a try. UTDR starts at "only" $300. UTDR did indeed recover the data, although they charged$100 extra for this one. Mac fee? Tricky hw problem? I don't know. But it was still a lot cheaper than the other companies I googled for fixing a physically dead drive. (As opposed to a corrupt partition table or something where the hardware itself was okay.) At least, the ones that actually give you a price up front rather than hiding behind "request a quote!"

## Tuesday, October 09, 2007

### Semi-automatic software installation on HP-UX, with dependencies

I had to install subversion on a couple HP-UX boxes. Fortunately, there's an HP-UX software archive out there with precompiled versions of lots of software. Unfortunately, dependency resolution is like the bad old days of 1997: entirely manual. And there's fifteen or so dependencies for subversion.

So, I wrote a script to parse the dependencies and download the packages automatically. It requires Python -- which you can install from the archive with just the Python package and the Db package -- and BeautifulSoup, which you can google for. Usage is

hpuxinstaller <archive package url> <package name>
[e.g., hpuxinstaller http://hpux.cs.utah.edu/hppd/hpux/Development/Tools/subversion-1.4.4/ subversion]
gunzip *.gz
[paste in conveniently given swinstall commands]


Here is the script:

#!/usr/local/bin/python

import urlparse, urllib2, sys, os
from subprocess import Popen, PIPE
from BeautifulSoup import BeautifulSoup

required = {}
if not os.path.exists('cache'):
os.mkdir('cache')

def getcachedpage(url):
fname = 'cache/' + url.replace('/', '-')
try:
except IOError:
print 'fetching ' + url
file(fname, 'wb').write(page)
return page

def dependencies(url):
scheme, netloc, _, _, _, _ = urlparse.urlparse(url)
soup = BeautifulSoup(getcachedpage(url))
text = soup.find('td', text='Run-time dependencies:')
if not text:
return
tr = text.parent.parent
td = tr.findAll('td')[1]
for a in td.findAll('a'):
yield (a.contents[0], '%s://%s%s' % (scheme, netloc, a['href']))

required[name] = url
for depname, depurl in dependencies(url):
if depname in required:
continue
print "%s requires %s" % (name, depname)
required[depname] = depurl

_, _, path, _, _, _ = urlparse.urlparse(full_url)
fname = os.path.basename(path)
f = file(fname, 'wb')
def chunkify_to_eof(stream, chunksize=64*1024):
while True:
if not data:
break
yield data
for chunk in chunkify_to_eof(urllib2.urlopen(full_url)):
f.write(chunk)

# Compute dependencies before checking for installed files, since swinstall
# can let a package be installed w/o its dependencies. If there are such
# packages installed we don't want to skip their [missing] dependencies.

try:
p = Popen(['swlist'], stdout=PIPE)
except:
print 'Warning: unable to list installed packages'
installed = {}
else:
installed = set(line.strip().split()[0] for line in p.stdout if line.strip())

to_install = []
for name, url in required.iteritems():
if name in installed:
print name + ' is already installed'
continue
full_url = '%s%s-ia64-11.23.depot.gz' % (url.replace('/hppd/', '/ftp/'), url.split('/')[-2])
to_install.append(os.path.basename(full_url))

if to_install:
print "\nAfter gunzip, run:"
for fname in to_install:
print "swinstall -s %s/%s %s" % (os.getcwd(), fname[:-3], fname.split('-')[0])
else:


## Friday, October 05, 2007

### Congratulations, Mozy

I left backup service provider Mozy about three months ago, and yesterday they were acquired by EMC as rumored by techcrunch earlier.

The cool thing about startups is they pretty much have to hire people who are totally not qualified to do awesome things and let them try. There's no way Amazon would have hired me to write S3, but that's what I did for Mozy.

Mozy was the third startup I've been a part of, and the first to amount to anything. I was employee number #3 and saw it grow from sharing a single rented office to 50 employees in two years. With people who didn't think it was strange to wear a tie to work. Trippy.

Unfortunately I'm not there to witness the final stage of being assimilated by the Borg firsthand, but I hear that's not really any more fun than it sounds so perhaps it's just as well.

Nice work, guys.

## Tuesday, October 02, 2007

### Wing IDE 3, Wing IDE 101 released

Wing IDE version 3 has been released.

The list of new features is a little underwhelming. Multi-threaded debugging and the unit testing tool (only supporting unittest -- does anyone still use that old module anymore?) are nice but I don't see myself paying to upgrade from 2.1 yet. Now if they could get the GUI to keep up with my typing in Windows, I'd pay for that... I guess this is a sign that Python IDEs are nearing maturity; Komodo 4 didn't have any earth-shaking new features either, at least as far as Python was concerned.

(Personally I think someone should start supporting django/genshi/mako templates already. Maybe in 3.1, guys?)

Following ActiveState's lead, Wingware has also released a completely free version, Wing IDE 101. The main difference is that where the most essential feature Komodo Edit leaves out as an incentive to upgrade is debugging, Wing IDE 101 includes the debugger but omits code completion. Wingware also continues to offer the low-cost Personal edition.

But the really big difference between Wing IDE 101 and Komodo Edit is that you can freely use Komodo Edit for paying work. Wing IDE 101, like Wing IDE Personal has a no-commercial-use clause. (Komodo versions compared; Wing versions compared.) I'm still of the opinion that at $180, Wing Professional will pay for itself in short order, but for the hobbyist, Komodo Edit is very compelling. I've been using it myself for TCL and XML editing for several months now and it's a nice little IDE. Too bad Komodo's emacs bindings continue to suck balls -- I mean, it's one thing to not implement fancy things like a minibuffer or kill ring, but if you can't even get C-W (cut) right, there's not much hope. Users contributed much-improved Emacs bindings to the ActiveState bug tracker way back in the version 3 timeframe. I guess ActiveState just doesn't care. ## Friday, September 21, 2007 ### That wasn't the pigeonhole I expected I went to the BYU CS alumni dinner tonight. At one point they briefly put everyone's name and position on a projector, one at a time. (At five seconds apiece it wasn't as tedious as it sounds.) When it was my turn, it announced "Jonathan Ellis, System Administrator." What the hell? It turns out that when I RSVP'd I said I was a "python kung-fu master & sysadmin of last resort." (In the sense that, if you really can't find a better sysadmin, I know enough to be dangerous.) Don't bother trying to be clever around bureaucrats. ## Saturday, September 08, 2007 ### Utah Open Source Conference 2007 The first Utah Open Source Conference finished today. I heard that they had close to 300 attendees -- not bad at all for a freshman effort. I reprised presentations that I've given before, on SQLAlchemy and distributed source control. My slides are on the presentations page (although if you've seen my slides from either before, there's not much new there -- I got lucky, SA 0.4 isn't stable yet so I stuck with 0.3.10). I had to work Friday so I missed a lot of presentations, but of the one I saw my favorite was on Ganglia, which I hadn't heard of before but which looks quite useful for anyone running a bunch of servers that takes uptime and qos seriously. (This was actually Brad Nicholes's third presentation of the conference -- he must have been busy!) Afterwards I went to the board games BoF and played Mag Blast. Fun little game. ## Wednesday, September 05, 2007 ### What it means to "know Python" Since Adam Barr replied to my post on his book, I'd like to elaborate a little on what I said. Adam wrote, [F]or me, "knowing" Python means you understand how slices work, the difference between a list and a tuple, the syntax for defining a dictionary, that indenting thing you do for blocks, and all that. It's not about knowing that there is a sort() function. In Python, reinventing sort and split is like a C programmer starting a project by writing his own malloc. It just isn't something you see very often. Similarly, I just don't think you can credibly argue that a C programmer who doesn't know how to use malloc really knows C. At some level, libraries do matter. On the other hand, I wouldn't claim that you must know all eleventy jillion methods that the Java library exposes in one way or another to say you know Java. What is the middle ground here? I think the answer is something along the lines of, "you have to get enough practice actually using the language to be able to write idiomatic code." That's necessarily going to involve picking up some library knowledge along the way. This made me think. What are the most commonly used Python modules? I decided to scan the Python Cookbook's code base and find out. This is a fairly large sample (over 2000 recipes), and further attractive in that most of the scripts there are reasonably standalone, so they're not filled with importing lots of non-standard modules. The downside is there is code dating back at least to the very ancient Python 1.5 version. In 2000+ source files and almost 4000 imports of stdlib modules, here are the frequency counts of imported modules. Is this a reasonable list? I obviously think I qualify as knowing Python well enough to blog about it. Of the modules above the 80% line, _winreg, win32con, and win32api are platform-specific; new is deprecated, string isn't officially deprecated but should be, and __future__ isn't really a module per se. I believe I've used all of the rest but xmlrpclib at some point, although my line of comfort-without-docs would be only about the 60% mark. I think anyone who programs professionally will quickly get to knowing well at least the modules up to the 50% line.  sys 473 os 302 24% time 210 re 145 35% string 140 random 103 threading 66 socket 57 os.path 52 types 50 Tkinter 47 50% math 43 win32com.client 42 __future__ 41 traceback 40 itertools 38 doctest 37 urllib 35 cStringIO 33 struct 32 60% win32api 31 getopt 29 thread 29 ctypes 28 StringIO 28 inspect 26 win32con 25 copy 25 cPickle 25 operator 24 datetime 23 cgi 22 70% Queue 22 urllib2 20 md5 20 base64 20 xmlrpclib 19 sets 19 optparse 19 logging 18 weakref 18 shutil 17 unittest 17 pprint 16 urlparse 15 getpass 15 httplib 15 pickle 15 _winreg 14 UserDict 13 signal 13 80% For those interested, a tarball of the recipes I scanned is here, so you don't need to scrape the Cookbook site yourself. The import scanning code is simple enough: import os, re, compiler from collections import defaultdict # define an AST visitor that only cares about "import" and "from [x import y]" nodes count_by_module = defaultdict(lambda: 0) class ImportVisitor: def visitImport(self, t): for m in t.names: if not isinstance(m, basestring): m = m[0] # strip off "as" part count_by_module[m] += 1 def visitFrom(self, t): count_by_module[t.modname] += 1 # parse for fname in os.listdir('recipes'): try: ast = compiler.parseFile('recipes/%s' % fname) except SyntaxError: continue compiler.walk(ast, ImportVisitor()) print 'parsed ' + fname # some raw stats, for posterity counts = count_by_module.items() total = sum(n for module, n in counts) print '%d/%d total/unique imports' % (total, len(counts)) # strip out non-stdlib modules for module in count_by_module.keys(): try: __import__(module) except (ImportError, ValueError): del count_by_module[module] # post-stripped stats counts = count_by_module.items() total = sum(n for module, n in counts) print '%d/%d total/unique imports in stdlib' % (total, len(counts)) counts.sort(key=lambda (module, n): n) # results subtotal = 0 for module, n in reversed(counts): subtotal += n print '%s\t%d' % (module, n) print '%f' % (float(subtotal) / total)  ## Tuesday, September 04, 2007 ### Merging two subversion repositories Update: an anonymous commenter pointed out that yes, there is a (much!) better way to do this with svnadmin load --parent-dir, which is covered in the docs under "repository migration." All I can say in my defense is that it wasn't something google thought pertinent. So, for google's benefit: how to merge subversion repositories. Thanks for the pointer, anonymous! I needed to merge team A's svn repository into team B's. I wanted to preserve the history of team A's commits as much as reasonably possible. I would have thought that someone had written a tool to do this, but I couldn't find one, so I wrote this. (Of course, now that I'm posting this, I fully expect someone to point me to a better, pre-existing implementation that I missed.) The approach is to take a working copy of repository B, add a directory for A's code, and for each revision in A's repository, apply that to the working copy and commit it. This would be easy if svn merge would allow applying diffs from repository A into a working copy of repository B, but it does not. I can't think of a technical reason for this. (In fact, I seem to remember that early versions of the svn client did allow this, with dire warnings, but I could be mistaken and I don't have a 1.1 client around anymore.) So I tried instead to use "svn diff |patch -p0", which worked great up until the first commit with a binary file. Oops. For the final version I ended up having to create a working copy for A, update to each revision there, then rsync to the right point in working copy B and call the "svnaddremove" script to mark files added or deleted. (This is suboptimal since we can get the exact changed paths from svn, and just copy those files over, but rsync is fast enough as long as your working copies stay in cache. The update and commit steps both consistently took longer than rsync in my timing.) My script does not try to be intelligent about copies or moves that svn knows about. Team A did not use branches or tags much so I didn't put the effort in to deal with those the "right" way (which would be to also issue a cp/mv on B's working copy to preserve history). It also uses unix users to commit revisions with the same name as the original commit. Doing this obviously requires at least access to the repository server to add the right users. I used "svn log -q |grep ^r |awk '{print$3}' |sort |uniq |useradd."

Final note: the perl script in svnaddremove is a long way of writing "awk {print $2}", except that it preserves filenames with spaces in them. There is probably a much more clever way of doing this, too. Here, then, is the merge script: #!/usr/bin/python # usage: svnimport <source wc path> <target wc path> <revstart> <revend> # e.g. svnimport liberte-source trunk/liberte 1 2000 from subprocess import Popen, PIPE try: from xml.etree import cElementTree as ET except ImportError: from elementtree import ElementTree as ET import sys, time def system(*args): p = Popen(args, stdout=PIPE, stderr=PIPE) out, err = p.communicate() if err: raise err return out # super-minimal log scraper # for a better one see hgsvn's svnclient.py, http://cheeseshop.python.org/pypi/hgsvn def parse_date(svn_date): date = svn_date.split('.', 2)[0] return time.strftime("%Y-%m-%d", time.strptime(date, "%Y-%m-%dT%H:%M:%S")) def parse_svn_log_xml(xml): tree = ET.fromstring(xml) for entry in tree.findall('logentry'): d = {} d['revision'] = int(entry.get('revision')) author = entry.find('author') d['author'] = author is not None and author.text or None d['message'] = entry.find('msg').text or "" d['date'] = parse_date(entry.find('date').text) yield d def edited_message(entry): msg = entry['message'].strip().replace('\r\n', '\n') addendum = '[original revision %s committed %s]' % (entry['revision'], entry['date']) if msg: return msg + '\n' + addendum return addendum sourcepath, targetpath, revstart, revstop = sys.argv[1:] # rsync foo bar and rsync foo/ bar/ are very different! if not sourcepath.endswith('/'): sourcepath += '/' if not targetpath.endswith('/'): targetpath += '/' xml = system('svn', 'log', sourcepath, '--xml', '-r', '%s:%s' % (int(revstart) + 1, revstop)) for entry in parse_svn_log_xml(xml): revno = entry['revision'] print 'merging revision %d by %s' % (revno, entry['author']) # merge in the revision system('svn', 'up', '-r', str(revno), sourcepath) print '\trsync' system('rsync', '-a', '--exclude=.svn', '--delete', sourcepath, targetpath) system('/tmp/svnaddremove', targetpath) # svn should add this. hg already did. # commit as the correct author, if available author = entry['author'] print '\tchown' system('chown', '-R', author, targetpath) quoted_message = edited_message(entry).replace('"', "'") print '\tci' system('su', author, '-c', 'svn ci %s -m "%s"' % (targetpath, quoted_message))  And here is svnaddremove: #!/bin/bash # odd, xargs is invoking svn add/rm w/ no args when grep returns no lines. # fix that with the ifs. # (don't use grep -q or svn gets pissed about broken pipe.) if svn st$1 | grep ^\? > /dev/null; then
svn st $1 | perl -ne 'chomp; @Fld = split(q{ },$_, -1); if (/^\?/) { shift @Fld; print join(q{ }, @Fld) . "\n"; }' | xargs -n 1 -i svn add "{}"
fi

## Monday, April 02, 2007

### New mailing list for utah python user group

Since I neglected to archive the old list when moving utahpython.org to a new server, the utah python user group has a new mailing list courtesy of Google Groups. (At least this way we're not dependent anymore on my incompetent sysadminning.)

## Thursday, March 29, 2007

### One thing I don't hate about Python

Sure, some things about Python bug me. But that's not what this is about. I wanted to react to Jacob Kaplan-Moss's gripes instead of promulgating my own. Specifically, his problem with Python's interfaces, or lack thereof.

I think I can keep this brief: interfaces are a hack that Java uses because Gosling et al thought multiple inheritance was too confusing and/or dangerous. (I believe I've read something recently where Gosling said that this was one decision he might do differently if he were re-designing Java now with the benefit of hindsight, but I can't find the source. Anyone remember seeing that?)

Python has MI. It doesn't need interfaces. I'm a little baffled that someone on the django core team would cite this as a problem with Python.

Jacob's precise objection is,

I shouldn’t need to care care about the difference between something that pretends to be a list and something that really is a list.

That's just it! You don't! But of course what Jacob really means is, "It should be easy to discover what methods a library expects to find on MY object that pretends to be a list." Which seems reasonable. And sure, good documentation is always welcome.

But when you cross the line to an Interface, at least the kind of Interface where Python itself would raise an error if I ignored the recommendation and left a method out (because I knew it wasn't necessary), that's bondage & discipline. That's not Python.

## Friday, March 02, 2007

### Introduction to Python at UGIC conference

I'll be giving a (very!) introductory Python workshop at the Utah Geographic Information Council conference in April. After my 90 minutes, Kevin Bell -- also of the utah python user group -- will present on specific GIS applications.

(Apparently Python is particularly big in GIS these days because one of the big vendors, ERSI, takes Python pretty seriously.)

## Saturday, February 24, 2007

### PyCon web frameworks panel notes

I represented Spyce for the web frameworks panel. It was pretty cool looking out at the standing-room-only crowd, even though let's face it, most people were not there because of Spyce. :)

James Bennett and Matt Harrison have notes posted online.

## Friday, February 23, 2007

### PyCon SqlSoup slides

I uploaded my slides for my Sunday SqlSoup talk, so they're linked in the schedule now. I also uploaded them here.

Update: the profile.py module I showed is here.

### PyCon open-space talk on Spyce after the web frameworks panel

I'll demonstrate writing a simple app with Spyce, including the most painless Ajax you have ever seen. Come check it out at 3:40 in the Bent Tree II room.

## Thursday, February 22, 2007

### PyCon SQLAlchemy tutorial slides

My SQLAlchemy tutorial went pretty well for the most part. It was a fast pace but most people kept up pretty well. If I did it again I would add more of an intro to ORM in general for people who had never used one, but over half the attendees had used SO or django's or tried SA already. I would also paste more code from my slides into the samples download to save people typing during the exercises (I had some, but I would do more next time).

I think most people liked it; the main exception was one fellow who was in way way over his head and visibly pissed about it. (I used a list comprehension at one point and he had no idea what it was.)

The slides are here. (The .py files referred to in the slides have also been moved to the jellis/ subdirectory.)

## Tuesday, February 20, 2007

### Spyce at PyCon

I'll be representing Spyce as a late addition to the Web Frameworks panel. I'm also planning a lightning talk on Ajax in Spyce 2.2 (which will be released as soon as I finish getting the docs in shape) and an open-space Introduction to Spyce.

See you there!

## Thursday, February 15, 2007

My wife got me a PSP for Valentine's Day, so I'm looking for videos to put on it. Since Google makes it so easy to get a PSP-compatible version, I thought I'd start with theirs... Recommendations?

## Wednesday, February 14, 2007

### SQLAlchemy slides

I presented on SQLAlchemy at the Utah python user group last Thursday; slides are linked here.

In retrospect, for a shorter presentation like this I should probably spend more time talking about the ORM features, and less about the SQL layer. Although the SQL layer is useful on its own, and essential for doing advanced mapping, I don't think it has the sex appeal that the ORM has.

(Although I do think the first part, about why ORMs should allow you to take advantage of your database's strengths rather than being limited to a MySQL 3 feature set, was useful.)

## Thursday, January 25, 2007

### Komodo 4 released; new free version

ActiveState has released Komodo IDE 4. Perhaps more interesting, if you're not already a Komodo user, is the release of Komodo Edit, which is very similar to the old Komodo IDE Personal edition, only instead of costing around $30, Komodo Edit is free. The mental difference between "free" and "$30" is much more than the relatively small amount of money; it will be interesting to see what happens in the IDE space now.

After a brief evaluation I would say Edit is perhaps the strongest contender for "best free python IDE." The only serious alternative is PyDev, which on its Eclipse foundation provides features like svn integration that Edit doesn't. PyDev also includes a debugger, another feature ActiveState would like to see you upgrade to the full IDE for. But Komodo is stronger in other areas such as call tips and, well, not being based on Eclipse. I also think its code completion is better, although this impression is preliminary.

It's also worth noting that so far, Edit doesn't sport the "Non-commercial and educational use only" restrictions that Komodo Personal had.

See you there!

## Wednesday, January 17, 2007

### Caution: upgrading to new version of blogger may increase spam

I was pretty happy with the old version of blogger, but I upgraded today so I can use the new API against my own blog. So far I have 4 spam comments (captcha is still on) versus about that number for the entire life of my blog under the old blogger. Bleh.

Could just be a coincidence. I hope so.

(Update Feb 26: A month later, I've had just one more spam comment. So it probably really was just coincidence.)

## Monday, January 15, 2007

### Abstract of "Advanced PostgreSQL, part 1"

In December, Fujitsu made available a video of Gavin Sherry speaking on Advanced PostgreSQL. (Where's part 2, guys?) Here's some of the topics Gavin addresses, and the approximate point at which they can be found in the video.

[start]
wal_buffers: "at least 64"; when it's ok to turn fsync off [not very often]; how hard disk rpm limits write-based transaction rate, even with WAL
00:12:
wal_sync_method = fdatasync is worth checking out on Linux
00:13:
FSM [free space map], MVCC, and vacuum; how to determine appropriate FSM size; why this is important to avoid VACUUM FULL
00:22:
vaccum_cost_delay
00:26:
background writer
00:30:
history of buffer replacement strategies
00:37:
scenarios where bgwriter is not useful
00:41:
how random_page_cost affects planner's use of indexes
00:47:
effective_cache_size
00:49:
logging; how to configure syslog to not hose your performance
00:52:
linux file system configuration
00:58:
solaris fs config
1:02:
raid; reliability; sata/scsi; battery-backed cache ("for \$100, you can triple the write throughput of your system")
1:08:
tablespaces
1:12:
increasing pgsql_tmp performance for queries that exceed work_mem and how to tell if this is worth worrying about
1:15:40
cpu considerations

## Friday, January 12, 2007

### Why SQLAlchemy impresses me

One of the reasons ORM tools have a spotted reputation is that it's really, really easy to write a dumb ORM that works fine for simple queries but performs like molasses once you start throwing real data at it.

Let me give an example of a situation where, to my knowledge, only SQLAlchemy of the Python (or Ruby) ORMs is really able to handle things elegantly, without gross hacks like "piggy backing."

Often you'll see a one-to-many relationship where you're not always interested in all of the -many side. For instance, you might have a users table, each associated with many orders. In SA you'd first define the Table objects, then create a mapper that's responsible for doing The Right Thing when you write "user.orders."

(I'm skipping connecting to the database for the sake of brevity, but that's pretty simple. I'm also avoiding specifying columns for the Tables by assuming they're in the database already and telling SA to autoload them. Besides keeping this code shorter, that's the way I prefer to work in real projects.)

users = Table('users', metadata, autoload=True)

class User(object): pass
class Order(object): pass

mapper(User, users,
properties={
'orders':relation(mapper(Order, orders), order_by=orders.c.id),
})


That "properties" dict says that you want your User class to provide an "orders" attribute, mapped to the orders table. If you are using a sane database, SQLAlchemy will automatically use the foreign keys it finds in the relation; you don't need to explicitly specify that it needs to join on "orders.user_id = user.id."

We can thus write

for user in session.query(User).select():
print user.orders


So far this is nothing special: most ORMs can do this much. Most can also specify whether to do eager loading for the orders -- where all the data is pulled out via joins in the first select() -- or lazy loading, where orders are loaded via a separate query each time the attribute is accessed. Either of these can be "the right way" for performance, depending on the use case.

The tricky part is, what if I want to generate a list of all users and the most recent order for each? The naive way is to write

class User:
@property
def max_order(self):
return self.orders[-1]

for user in session.query(User).select():
print user, user.max_order


This works, but it requires loading all the orders when we are really only interested in one. If we have a lot of orders, this can be painful.

One solution in SA is to create a new relation that knows how to load just the most recent order. Our new mapper will look like this:

mapper(User, users,
properties={
'orders':relation(mapper(Order, orders), order_by=orders.c.id),
'max_order':relation(mapper(Order, max_orders, non_primary=True), uselist=False, viewonly=True),
})


("non_primary" means the second mapper does not define persistence for Orders; you can only have one primary mapper at a time. "viewonly" means you can't assign to this relation directly.)

Now we have to define "max_orders." To do this, we'll leverage SQLAlchemy's ability to map not just Tables, but any Selectable:

max_orders_by_user = select([func.max(orders.c.order_id).label('order_id')],
group_by=[orders.c.user_id]).alias('max_orders_by_user')
max_orders = orders.select(orders.c.order_id==max_orders_by_user.c.order_id).alias('max_orders')


"max_orders_by_user" is a subselect whose rows are the max order_id for each user_id. Then we use that to define max_orders as the entire order row joined to that subselect on user_id.

We could define this as eager-by-default in the mapper, but in this scenario we only want it eager on a per-query basis. That looks like this:

q = session.query(User).options(eagerload('max_order'))
for user in q.select():
print user, user.max_order

For fun, here's the sql generated:
SELECT users.user_name AS users_user_name, users.user_id AS users_user_id,
anon_760c.order_id AS anon_760c_order_id, anon_760c.user_id AS anon_760c_user_id,
anon_760c.description AS anon_760c_description,
anon_760c.isopen AS anon_760c_isopen
FROM users LEFT OUTER JOIN (
SELECT orders.order_id AS order_id, orders.user_id AS user_id,
orders.description AS description, orders.isopen AS isopen
FROM orders, (
SELECT max(orders.order_id) AS order_id
FROM orders GROUP BY orders.user_id) AS max_orders_by_user
WHERE orders.order_id = max_orders_by_user.order_id) AS anon_760c
ON users.user_id = anon_760c.user_id
ORDER BY users.oid, anon_760c.oid


In SQLAlchemy, easy things are easy; hard things take some effort up-front, but once you have your relations defined, it's almost magical how it pulls complex queries together for you.

.................

I'm giving a tutorial on Advanced Databases with SQLAlchemy at PyCon in February. Feel free to let me know if there is anything you'd like me to cover specifically.

### MySQL backend performance

Vadim Tkachenko posted an interesting benchmark of MyISAM vs InnoDB vs Falcon datatypes. (Falcon is the new backend that MySQL started developing after Oracle bought InnoDB.) For me the interesting part is not the part with the alpha code -- Falcon is competitive for some queries but gets absolutely crushed on others -- but how InnoDB is around 30% faster than MyISAM. And these are pure selects, supposedly where MyISAM is best.

Of course this is a small benchmark and YMMV, but this is encouraging to me because it suggests that if I ever have to use MySQL, I can use a backend with transactions, real foreign key support, etc., without sucking too badly performance-wise.

(It also suggests that people who responded to the post on postgresql crushing mysql in a different benchmark by saying, "well, if they wanted speed they should have used MyISAM," might want to reconsider their advice.)

## Wednesday, January 10, 2007

### Fun with three-valued logic

I thought I was pretty used to SQL's three-valued logic by now, but this still caused me a minute of scratching my head:

# select count(*) from _t;
count
-------
1306
(1 row)

# select count(*) from _t2;
count
-------
19497
(1 row)

Both _t and _t2 are temporary tables of a single column I created with SELECT DISTINCT.
# select count(*) from _t where userhash in (select userhash from _t2);
count
-------
982
(1 row)

# select count(*) from _t where userhash not in (select userhash from _t2);
count
-------
0
(1 row)


Hmm, 982 + 0 != 1306...

Turns out there was a null in _t2; X in {set containing null} evaluates to null, not false, and negating null still gives null. (The rule of thumb is, any operation on null is still null.)

.................

I'm giving a tutorial on Advanced Databases with SQLAlchemy at PyCon in February. Feel free to let me know if there is anything you'd like me to cover specifically.

## Tuesday, January 02, 2007

### Good advice for Tortoise SVN users

My thinkpad R52's screen died a couple days ago. I decided that this time I was going to be a man and install Linux on my new machine: all our servers run Debian, and "apt-get install" is just so convenient vs manual package installation on Windows. And it looks like qemu is a good enough "poor man's vmware" that I could still test stuff in IE when necessary.

Alas, it was not to be. My new laptop is an HP dv9005, and although ubuntu's livecd mode ran fine, when it actually installed itself to the HDD and loaded X it did strange and colorful things to the LCD. Things that didn't resemble an actual desktop. When I told it to start in recovery mode instead it didn't even finish booting.

That was all the time I had to screw around, so I reinstalled Windows to start getting work done again. Which brings me (finally!) to this advice on tortoisesvn: it really puts teh snappy back in the tortoise. Thanks annonymous progblogger!