Sunday, December 30, 2007

Scala: first impressions

I'm reading the prerelease of the Scala book, since I'm working for a heavily Java-invested organization now and programming in Java feels like running a race with cement shoes. Politically, Jython doesn't seem like an option; Scala might be an easier sell.

Here's some of my impressions going through the book.

Scala is a decent scripting language

[updated this section thanks to comments from anonymous, Jörn, and Eric.]

Here's how you can loop through each line in a file: Python:

import sys
for line in open(sys.argv[0]):
   print line
Scala:
import scala.io.Source
Source.fromFile(args(0)).getLines.foreach(println)

The scala interpreter also has a REPL, which is nice for experimenting.

This article has more examples of basic scripting in scala.

Getters and setters

One thing I despise in Java is the huge amount of wasted lines of code dedicated to getFoo and setFoo methods. Sure, your IDE can autogenerate these for you, but it still takes up lines in your editor and effort to mentally block them out when examining an unfamiliar class to determine what it does.

C# has Python-style properties, so in theory could be virtually free of this kind of boilerplate, since there is no syntactic difference between "foo.x = y" whether x is a raw field or a property. So, the right thing to do, which you'll see in Python code, is using a raw public field until you actually need extra logic, at which point you replace it with a property and nobody's code breaks.

But C# wasn't far enough removed from Java culturally so everyone writes boilerplate properties instead of boilerplate getters and setters. (I'm aware that in a compiled language like C# it makes sense to start with properties for classes in certain kinds of libraries but these make up a vanishingly small number of actual classes in the wild.)

This is a long way of saying that Scala's properties make a lot of sense given the audience they are targetting (Java programmers). Scala vars are automatically turned into properties (or getters and setters, if you prefer to think in terms of those). So even the most obstinate fan of boilerplate code has absolutely no reason to keep writing unnecessary getters and setters.

If you need to manually tweak your properties later, you can magically define a setter for "field" by naming a method "field_=". This seems a bit like a one-off hack rather than part of a well-designed system, but I can live with it. (Since parentheses are optional for a scala method taking no arguments, any no-argument method call is already syntactically indistiguishable from a raw field read.)

Scala doesn't know what it wants to be when it grows up, yet

Scala contains a lot of features from a lot of different influences, which means there's often competing styles to use; the young scala community hasn't yet decided which to emphasize. For instance, iteration vs traversal -- "for (item <- items)" vs "items.foreach" -- or even whether to leave out optional (inferred) type declarations.

Perhaps this is merely a failure of one book trying too hard to make scala be all things to all readers.

Random thoughts

  • Scala needs something like Python's enumerate function; if you want to loop each object in a collection but you need its index too, you have to do a manual for loop.
  • Using parenthesis for collection access ("array(0); map("foo"))") makes it unnecessarily difficult to tell if you're looking at a method call or a collection access.
  • The scala stdlib is not very google-able yet; if you don't know exactly the class you are looking for ahead of time, you probably won't find it. For example, when looking for a scala object that could iterate through lines in a file, I correctly guessed the scala.io package, but would never have bothered looking at Source until more googling turned it up in a blog entry.

Conclusion and disclaimer

This is long enough; I'll probably write more as I go through more of the book. Patterns and implicit conversions are particularly interesting.

I was attracted to scala while looking for a sane alternative to Java on the JVM and it looks like scala might "drag Java programmers halfway to Python," to paraphrase Guy Steele. And I'm happy to be corrected on anything I've written here.

Friday, December 28, 2007

Troubleshooting the ps3 wireless network connection, including error 80130128

My father got a ps3 for Christmas, but ran into some problems getting it on his wireless network. The first one was "connection error 80130128" after configuring it to use DHCP. I couldn't google anything useful about this; just a few other hapless victims asking if anyone had any ideas. Fortunately Dad had his laptop there too and noticed Windows complaining that two machines on the network were both using the same IP. So, over the phone, I walked him through setting up the ps3 with a static address:
  1. on his laptop, run -> cmd
  2. ipconfig
  3. Read the "gateway" ip. Put that into his browser to go to his router's admin page
  4. Find the DHCP settings for his router to see what range of IPs it hands out; pick one outside that range
  5. Set up the ps3 with that IP, the router IP as primary dns, and an opendns server as secondary
This made the connection test happy. But when he tried to go to the playstation store, it gave a DNS error. If he repeated the connection test again, it failed too. "Well," I told him, "It's supposed to try both DNS servers. But we can try setting the primary DNS server to opendns as well." Once he did that, everything worked.

Friday, November 16, 2007

Reed-Solomon libraries

If you want to run a multi-petabyte storage system then you don't want to do it with Raid 5 or Raid 6; with modern disks' ~3% per year failure rate, that's 300 a year when you have 10000 disks and the odds start to get pretty good (relatively speaking) that you'll face permanent data loss at some point when you lose a third disk from an array while two are rebuilding. And of course monitoring and replacing disks in lots of small arrays is manpower-intensive, which to investors translates as "expensive."

You probably don't want to go with triplication, either; disks are cheap, but not so cheap that you want to triple your hardware costs unnecessarily. While storing multiple copies of frequently used data is good, all your data probably isn't "frequently used."

What is the solution? As it turns out, Raid is actually a special case of Reed-Solomon encoding, which lets you specify any degree of redundancy you want. You can be safer than triplication with a fraction of the space needed.

I was prompted to write this because Mozy open-sourced the Reed-Solomon library I used while I was there, librs, complete with Python bindings. The original librs we used at Mozy was written by Byron Clark, a formidible task. Later we switched to the version you see on sourceforge, based on Plank's original encoder. I wasn't involved with librs at all except to fix a couple reference leaks in the Python wrapper.

But if you're actually looking for an rs library to use, Alen Peacock, who is much more knowledgeable than I about the gory details involved here, tells me that if you are starting from scratch the two libraries you should evaluate are zfec, which also comes with Python bindings, and Jerasure which is an updated -- i.e., probably faster than his first -- encoder by Plank. (Jerasure has nothing to do with Java.)

Thursday, October 18, 2007

Utah Data Recovery

About three years ago (so pre-Mozy and definitely pre-Mac Mozy) my brother had his powerbook hard disk die. As in, not just mostly dead -- it would not power up. It had a lot of stuff on it that he didn't want to lose, but he felt like the usual suspects who charge $1k to $2k for data recovery were a rip off. So he hung onto the disk in case a cheaper option came along.

Then just recently when I saw some people on a local linux group mailing list recommend utah data rescue I suggested to my brother that he give it a try. UTDR starts at "only" $300.

UTDR did indeed recover the data, although they charged $100 extra for this one. Mac fee? Tricky hw problem? I don't know. But it was still a lot cheaper than the other companies I googled for fixing a physically dead drive. (As opposed to a corrupt partition table or something where the hardware itself was okay.) At least, the ones that actually give you a price up front rather than hiding behind "request a quote!"

Tuesday, October 09, 2007

Semi-automatic software installation on HP-UX, with dependencies

I had to install subversion on a couple HP-UX boxes. Fortunately, there's an HP-UX software archive out there with precompiled versions of lots of software. Unfortunately, dependency resolution is like the bad old days of 1997: entirely manual. And there's fifteen or so dependencies for subversion.

So, I wrote a script to parse the dependencies and download the packages automatically. It requires Python -- which you can install from the archive with just the Python package and the Db package -- and BeautifulSoup, which you can google for. Usage is

hpuxinstaller <archive package url> <package name>
[e.g., hpuxinstaller http://hpux.cs.utah.edu/hppd/hpux/Development/Tools/subversion-1.4.4/ subversion]
[wait for packages to download]
gunzip *.gz
[paste in conveniently given swinstall commands]

Here is the script:

#!/usr/local/bin/python

import urlparse, urllib2, sys, os
from subprocess import Popen, PIPE
from BeautifulSoup import BeautifulSoup

required = {}
if not os.path.exists('cache'):
    os.mkdir('cache')

def getcachedpage(url):
    fname = 'cache/' + url.replace('/', '-')
    try:
        page = file(fname).read()
    except IOError:
        print 'fetching ' + url
        page = urllib2.urlopen(url).read()
        file(fname, 'wb').write(page)
    return page

def dependencies(url):
    scheme, netloc, _, _, _, _ = urlparse.urlparse(url)
    soup = BeautifulSoup(getcachedpage(url))
    text = soup.find('td', text='Run-time dependencies:')
    if not text:
        return
    tr = text.parent.parent
    td = tr.findAll('td')[1]
    for a in td.findAll('a'):
        yield (a.contents[0], '%s://%s%s' % (scheme, netloc, a['href']))

def add(name, url):
    required[name] = url
    for depname, depurl in dependencies(url):
        if depname in required:
            continue
        print "%s requires %s" % (name, depname)
        required[depname] = depurl
        add(depname, depurl)
        
def download(full_url):
    print 'downloading ' + full_url
    _, _, path, _, _, _ = urlparse.urlparse(full_url)
    fname = os.path.basename(path)
    f = file(fname, 'wb')
    def chunkify_to_eof(stream, chunksize=64*1024):
        while True:
            data = stream.read(chunksize)
            if not data:
                break
            yield data
    for chunk in chunkify_to_eof(urllib2.urlopen(full_url)):
        f.write(chunk)


# Compute dependencies before checking for installed files, since swinstall
# can let a package be installed w/o its dependencies. If there are such
# packages installed we don't want to skip their [missing] dependencies.
add(sys.argv[2], sys.argv[1])

try:
    p = Popen(['swlist'], stdout=PIPE)
except:
    print 'Warning: unable to list installed packages'
    installed = {}
else:
    installed = set(line.strip().split()[0] for line in p.stdout if line.strip())

to_install = []
for name, url in required.iteritems():
    if name in installed:
        print name + ' is already installed'
        continue
    full_url = '%s%s-ia64-11.23.depot.gz' % (url.replace('/hppd/', '/ftp/'), url.split('/')[-2])
    to_install.append(os.path.basename(full_url))
    download(full_url)

if to_install:
    print "\nAfter gunzip, run:"
    for fname in to_install:
        print "swinstall -s %s/%s %s" % (os.getcwd(), fname[:-3], fname.split('-')[0])
else:
    print 'Everything is already installed'

Friday, October 05, 2007

Congratulations, Mozy

I left backup service provider Mozy about three months ago, and yesterday they were acquired by EMC as rumored by techcrunch earlier.

The cool thing about startups is they pretty much have to hire people who are totally not qualified to do awesome things and let them try. There's no way Amazon would have hired me to write S3, but that's what I did for Mozy.

Mozy was the third startup I've been a part of, and the first to amount to anything. I was employee number #3 and saw it grow from sharing a single rented office to 50 employees in two years. With people who didn't think it was strange to wear a tie to work. Trippy.

Unfortunately I'm not there to witness the final stage of being assimilated by the Borg firsthand, but I hear that's not really any more fun than it sounds so perhaps it's just as well.

Nice work, guys.

Tuesday, October 02, 2007

Wing IDE 3, Wing IDE 101 released

Wing IDE version 3 has been released.

The list of new features is a little underwhelming. Multi-threaded debugging and the unit testing tool (only supporting unittest -- does anyone still use that old module anymore?) are nice but I don't see myself paying to upgrade from 2.1 yet. Now if they could get the GUI to keep up with my typing in Windows, I'd pay for that... I guess this is a sign that Python IDEs are nearing maturity; Komodo 4 didn't have any earth-shaking new features either, at least as far as Python was concerned.

(Personally I think someone should start supporting django/genshi/mako templates already. Maybe in 3.1, guys?)

Following ActiveState's lead, Wingware has also released a completely free version, Wing IDE 101. The main difference is that where the most essential feature Komodo Edit leaves out as an incentive to upgrade is debugging, Wing IDE 101 includes the debugger but omits code completion. Wingware also continues to offer the low-cost Personal edition.

But the really big difference between Wing IDE 101 and Komodo Edit is that you can freely use Komodo Edit for paying work. Wing IDE 101, like Wing IDE Personal has a no-commercial-use clause. (Komodo versions compared; Wing versions compared.) I'm still of the opinion that at $180, Wing Professional will pay for itself in short order, but for the hobbyist, Komodo Edit is very compelling. I've been using it myself for TCL and XML editing for several months now and it's a nice little IDE.

Too bad Komodo's emacs bindings continue to suck balls -- I mean, it's one thing to not implement fancy things like a minibuffer or kill ring, but if you can't even get C-W (cut) right, there's not much hope. Users contributed much-improved Emacs bindings to the ActiveState bug tracker way back in the version 3 timeframe. I guess ActiveState just doesn't care.

Friday, September 21, 2007

That wasn't the pigeonhole I expected

I went to the BYU CS alumni dinner tonight. At one point they briefly put everyone's name and position on a projector, one at a time. (At five seconds apiece it wasn't as tedious as it sounds.)

When it was my turn, it announced "Jonathan Ellis, System Administrator."

What the hell?

It turns out that when I RSVP'd I said I was a "python kung-fu master & sysadmin of last resort." (In the sense that, if you really can't find a better sysadmin, I know enough to be dangerous.)

Don't bother trying to be clever around bureaucrats.

Saturday, September 08, 2007

Utah Open Source Conference 2007

The first Utah Open Source Conference finished today. I heard that they had close to 300 attendees -- not bad at all for a freshman effort.

I reprised presentations that I've given before, on SQLAlchemy and distributed source control. My slides are on the presentations page (although if you've seen my slides from either before, there's not much new there -- I got lucky, SA 0.4 isn't stable yet so I stuck with 0.3.10).

I had to work Friday so I missed a lot of presentations, but of the one I saw my favorite was on Ganglia, which I hadn't heard of before but which looks quite useful for anyone running a bunch of servers that takes uptime and qos seriously. (This was actually Brad Nicholes's third presentation of the conference -- he must have been busy!)

Afterwards I went to the board games BoF and played Mag Blast. Fun little game.

Wednesday, September 05, 2007

What it means to "know Python"

Since Adam Barr replied to my post on his book, I'd like to elaborate a little on what I said.

Adam wrote,

[F]or me, "knowing" Python means you understand how slices work, the difference between a list and a tuple, the syntax for defining a dictionary, that indenting thing you do for blocks, and all that. It's not about knowing that there is a sort() function.

In Python, reinventing sort and split is like a C programmer starting a project by writing his own malloc. It just isn't something you see very often. Similarly, I just don't think you can credibly argue that a C programmer who doesn't know how to use malloc really knows C. At some level, libraries do matter.

On the other hand, I wouldn't claim that you must know all eleventy jillion methods that the Java library exposes in one way or another to say you know Java.

What is the middle ground here?

I think the answer is something along the lines of, "you have to get enough practice actually using the language to be able to write idiomatic code." That's necessarily going to involve picking up some library knowledge along the way.

This made me think. What are the most commonly used Python modules? I decided to scan the Python Cookbook's code base and find out. This is a fairly large sample (over 2000 recipes), and further attractive in that most of the scripts there are reasonably standalone, so they're not filled with importing lots of non-standard modules. The downside is there is code dating back at least to the very ancient Python 1.5 version.

In 2000+ source files and almost 4000 imports of stdlib modules, here are the frequency counts of imported modules.

Is this a reasonable list? I obviously think I qualify as knowing Python well enough to blog about it. Of the modules above the 80% line, _winreg, win32con, and win32api are platform-specific; new is deprecated, string isn't officially deprecated but should be, and __future__ isn't really a module per se. I believe I've used all of the rest but xmlrpclib at some point, although my line of comfort-without-docs would be only about the 60% mark. I think anyone who programs professionally will quickly get to knowing well at least the modules up to the 50% line.

sys473
os302
24%
time210
re145
35%
string140
random103
threading66
socket57
os.path52
types50
Tkinter47
50%
math43
win32com.client42
__future__41
traceback40
itertools38
doctest37
urllib35
cStringIO33
struct32
60%
win32api31
getopt29
thread29
ctypes28
StringIO28
inspect26
win32con25
copy25
cPickle25
operator24
datetime23
cgi22
70%
Queue22
urllib220
md520
base6420
xmlrpclib19
sets19
optparse19
logging18
weakref18
shutil17
unittest17
pprint16
urlparse15
getpass15
httplib15
pickle15
_winreg14
UserDict13
signal13
80%

For those interested, a tarball of the recipes I scanned is here, so you don't need to scrape the Cookbook site yourself. The import scanning code is simple enough:

import os, re, compiler
from collections import defaultdict

# define an AST visitor that only cares about "import" and "from [x import y]" nodes
count_by_module = defaultdict(lambda: 0)
class ImportVisitor:
    def visitImport(self, t):
        for m in t.names:
            if not isinstance(m, basestring):
                m = m[0] # strip off "as" part
            count_by_module[m] += 1
    def visitFrom(self, t):
        count_by_module[t.modname] += 1

# parse
for fname in os.listdir('recipes'):
    try:
        ast = compiler.parseFile('recipes/%s' % fname)
    except SyntaxError:
        continue
    compiler.walk(ast, ImportVisitor())
    print 'parsed ' + fname

# some raw stats, for posterity
counts = count_by_module.items()
total = sum(n for module, n in counts)
print '%d/%d total/unique imports' % (total, len(counts))

# strip out non-stdlib modules
for module in count_by_module.keys():
    try:
        __import__(module)
    except (ImportError, ValueError):
        del count_by_module[module]
        
# post-stripped stats
counts = count_by_module.items()
total = sum(n for module, n in counts)
print '%d/%d total/unique imports in stdlib' % (total, len(counts))
counts.sort(key=lambda (module, n): n)

# results
subtotal = 0
for module, n in reversed(counts):
    subtotal += n
    print '%s\t%d' % (module, n)
    print '%f' % (float(subtotal) / total)

Tuesday, September 04, 2007

Merging two subversion repositories

Update: an anonymous commenter pointed out that yes, there is a (much!) better way to do this with svnadmin load --parent-dir, which is covered in the docs under "repository migration." All I can say in my defense is that it wasn't something google thought pertinent. So, for google's benefit: how to merge subversion repositories. Thanks for the pointer, anonymous!

I needed to merge team A's svn repository into team B's. I wanted to preserve the history of team A's commits as much as reasonably possible. I would have thought that someone had written a tool to do this, but I couldn't find one, so I wrote this. (Of course, now that I'm posting this, I fully expect someone to point me to a better, pre-existing implementation that I missed.)

The approach is to take a working copy of repository B, add a directory for A's code, and for each revision in A's repository, apply that to the working copy and commit it. This would be easy if svn merge would allow applying diffs from repository A into a working copy of repository B, but it does not. I can't think of a technical reason for this. (In fact, I seem to remember that early versions of the svn client did allow this, with dire warnings, but I could be mistaken and I don't have a 1.1 client around anymore.)

So I tried instead to use "svn diff |patch -p0", which worked great up until the first commit with a binary file. Oops. For the final version I ended up having to create a working copy for A, update to each revision there, then rsync to the right point in working copy B and call the "svnaddremove" script to mark files added or deleted. (This is suboptimal since we can get the exact changed paths from svn, and just copy those files over, but rsync is fast enough as long as your working copies stay in cache. The update and commit steps both consistently took longer than rsync in my timing.)

My script does not try to be intelligent about copies or moves that svn knows about. Team A did not use branches or tags much so I didn't put the effort in to deal with those the "right" way (which would be to also issue a cp/mv on B's working copy to preserve history). It also uses unix users to commit revisions with the same name as the original commit. Doing this obviously requires at least access to the repository server to add the right users. I used "svn log -q |grep ^r |awk '{print $3}' |sort |uniq |useradd."

Final note: the perl script in svnaddremove is a long way of writing "awk {print $2}", except that it preserves filenames with spaces in them. There is probably a much more clever way of doing this, too.

Here, then, is the merge script:

#!/usr/bin/python

# usage: svnimport <source wc path> <target wc path> <revstart> <revend>
# e.g. svnimport liberte-source trunk/liberte 1 2000

from subprocess import Popen, PIPE
try:
    from xml.etree import cElementTree as ET
except ImportError:
    from elementtree import ElementTree as ET
import sys, time

def system(*args):
    p = Popen(args, stdout=PIPE, stderr=PIPE)
    out, err = p.communicate()
    if err:
        raise err
    return out

# super-minimal log scraper
# for a better one see hgsvn's svnclient.py, http://cheeseshop.python.org/pypi/hgsvn
def parse_date(svn_date):
    date = svn_date.split('.', 2)[0]
    return time.strftime("%Y-%m-%d", time.strptime(date, "%Y-%m-%dT%H:%M:%S"))
def parse_svn_log_xml(xml):
    tree = ET.fromstring(xml)
    for entry in tree.findall('logentry'):
        d = {}
        d['revision'] = int(entry.get('revision'))
        author = entry.find('author')
        d['author'] = author is not None and author.text or None
        d['message'] = entry.find('msg').text or ""
        d['date'] = parse_date(entry.find('date').text)
        yield d
        
def edited_message(entry):
    msg = entry['message'].strip().replace('\r\n', '\n')
    addendum = '[original revision %s committed %s]' % (entry['revision'], entry['date'])
    if msg:
        return msg + '\n' + addendum
    return addendum

sourcepath, targetpath, revstart, revstop = sys.argv[1:]
# rsync foo bar and rsync foo/ bar/ are very different!
if not sourcepath.endswith('/'):
    sourcepath += '/'
if not targetpath.endswith('/'):
    targetpath += '/'

xml = system('svn', 'log', sourcepath, '--xml', '-r', '%s:%s' % (int(revstart) + 1, revstop))
for entry in parse_svn_log_xml(xml):
    revno = entry['revision']
    print 'merging revision %d by %s' % (revno, entry['author'])
    # merge in the revision
    system('svn', 'up', '-r', str(revno), sourcepath)
    print '\trsync'
    system('rsync', '-a', '--exclude=.svn', '--delete', sourcepath, targetpath)
    system('/tmp/svnaddremove', targetpath) # svn should add this.  hg already did.
    # commit as the correct author, if available
    author = entry['author']
    print '\tchown'
    system('chown', '-R', author, targetpath)
    quoted_message = edited_message(entry).replace('"', "'")
    print '\tci'
    system('su', author, '-c', 'svn ci %s -m "%s"' % (targetpath, quoted_message))

And here is svnaddremove:

#!/bin/bash

# odd, xargs is invoking svn add/rm w/ no args when grep returns no lines.
# fix that with the ifs.
# (don't use grep -q or svn gets pissed about broken pipe.)

if svn st $1 | grep ^\? > /dev/null; then
  svn st $1 | perl -ne 'chomp; @Fld = split(q{ }, $_, -1); if (/^\?/) { shift @Fld; print join(q{ }, @Fld) . "\n"; }' | xargs -n 1 -i svn add "{}"
fi
if svn st $1 | grep ^\! > /dev/null; then
  svn st $1 | perl -ne 'chomp; @Fld = split(q{ }, $_, -1); if (/^\!/) { shift @Fld; print join(q{ }, @Fld) . "\n"; }' | xargs -n 1 -i svn rm "{}"
fi

Monday, July 30, 2007

A brief reaction to "Find the Bug"

I picked up a copy of Adam Barr's Find the Bug, which is a cool concept for a book. (5 languages, 50 programs, 50 bugs; see if you can spot them.)

I found the bug in the first program, in C, then skipped to the Python chapter. The first two programs were not too bad, as pedagogical exercises go (although iterating through substrings instead of a.startswith(b) in the 2nd was painful). The third, though, was "Alphabetize words," 25 sloc to perform the equivalent of

def alphabetize(buffer):
  L = buffer.split(' ')
  L.sort()
  return L

... doing everything about the hardest way possible.

Now, it's pretty hard to introduce a non-obvious bug into my version of this function, so it wouldn't be appropriate for Mr. Barr's book when written this way. But the right thing to do is to make the task more difficult, not dumb Python down to the level of C! It's very very painful to read Python written like that.

(Actually it's painful to read any language written at such a low level of expressivity, which is why I prefer not to use languages that really can't do any better.)

Monday, July 23, 2007

Final version of OSCON SQLAlchemy slides

http://utahpython.org/jellis/sa-tutorial-oscon.pdf

Also the code snippets:

This is what I'll be using in my tutorial tomorrow.

Update: I forgot to "svn up" on my web server. So now the final version is up.

Monday, July 09, 2007

PEP rss feed is live

After I complained that python.org could use a PEP rss feed, David Goodger invited me to volunteer to write one. So I did. (With Martin v. Löwis doing the integration with the site build script. Thanks Martin!)

The feed is live at http://www.python.org/dev/peps/peps.rss.

Saturday, June 16, 2007

Opera 9.2 is a pretty good browser

I've been trying Opera 9.2 for a week, and I'm pleased with it enough that it's going to continue to be my main browser. The main selling points for me are

  • MDI weirdness is mostly hidden now, I hated earlier Opera UIs
  • 20-30% less memory use; even after poking about in the guts of about:config to force FF's memory cache to the same 10MB that I gave Opera (which exposes this option right in the UI), Opera consistently uses less memory for the same workload. (Without adding this option to FF, it would max out around 400MB instead of 150MB.)
  • feels snappier; opera seems quicker to start rendering something useful on slow-loading sites like 1up.com, although total render time is about the same. It's also instantaneous to open a new tab, which consistently takes around 1s on FF after I've been using it a while. I open and close tabs frequently.
  • UI takes up less space: I know it's possible to re-skin FF, but I'd have to google it to find out how. Opera makes it easy. I'm using the Fresh skin for Opera which condenses the File menu and nav bar to about half as much height as FF uses.
  • built-in AutoFill, so I save even more space by not needing Google Toolbar
  • javascript/DOM support is finally close enough to FF that most sites don't have to specifically code for Opera. Heavy ajax use is an exception of course; I still have to use FF for Google Docs. (Gmail and Maps work fine though, probably due to some effort by Google.)
  • Download manager goes in another tab by default instead of a separate window. I didn't realize how much FF's behavior annoyed me before.

Downsides:

  • Occasional rendering problems, such as when customizing a laptop on hp.com. Blogger.com doesn't redirect to my dashboard when I'm already logged in to my google account.
  • Very very slow navigating large pages, such as a slashdot comment thread with 400 comments. Isearch is even slower on large pages and can lock up the UI for minutes if you invoke it injudiciously.
  • Sometimes ignores a site's instructions to not cache a dynamic page
  • hard to tell which tab is active in the default theme. (Fresh fixes this.)

Thursday, June 14, 2007

A workaround for the sys.excepthook bug

About two years ago I reported the bug sys.excepthook doesn't work in threads. Then just recently someone asked in #utahpython if I had a workaround. Here it is (also added as a comment to the bug report) -- all we do is monkeypatch Thread.run to run the excepthook manually if there is an uncaught exception:
def install_thread_excepthook():
    """
    Workaround for sys.excepthook thread bug
    (https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1230540&group_id=5470).
    Call once from __main__ before creating any threads.
    If using psyco, call psyco.cannotcompile(threading.Thread.run)
    since this replaces a new-style class method.
    """
    import sys, threading
    run_old = threading.Thread.run
    def run(*args, **kwargs):
        try:
            run_old(*args, **kwargs)
        except (KeyboardInterrupt, SystemExit):
            raise
        except:
            sys.excepthook(*sys.exc_info())
    threading.Thread.run = run

Thursday, May 31, 2007

How DOS 1.0 cost me an hour of scratching my head

A couple months ago, I migrated my text rpg Carnage Blender to a new server, with Ubuntu 6.06 on the new box.

For an unknown reason, ftstrpnm on the new box wouldn't generate the pngs I used in my captchas. It was easier to just check in the images from the old machine into the my svn repository than debug this, so I did.

The downside was that my working copy on my Windows laptop stopped being able to update from the repository. It would get to "words/con.png," and error out. Google, for once, didn't turn up anything useful.

Today I got motivated. I tried all kinds of ways to get this to work. A new checkout had the same problem on Windows, but on Linux worked fine. The svn command line client for windows didn't work any better than Tortoise -- instead of "Error: Can't open file '...words\.svn\text-base\con.png.svn-base': Access is denied", it barfed con.png to stdout, and died. This was a clue, but I didn't realize that until later.

Puzzled, I tried scp-ing con.png directly. No dice. Maybe it was a problem with my ssh server or client instead of subversion. So I tarred up my Linux working copy and untarred on Windows. Still it crapped out on con.png. I gzipped just con.png and tried to scp that over. That didn't work either.

I started experimenting with the filename itself. I could scp the .gz just fine if I renamed it to c.png.gz first. But "touch con.png" on windows failed, as did "touch con.txt." Finally I googled [windows filenames con] and found the answer at the top of the results, unfortunately from before I started reading oldnewthing or it might have jogged my memory.

It's time for python development to open up a little

I found out from Brett Cannon's blog that an abstract base clase (ABC) PEP has been accepted.

I don't like this PEP. It's a very big (and more important, inelegant) change to Python's style. But my real complaint is that as big as this change is, and as much as I try to stay current with Python (subscribing to 30+ blogs) I didn't have a chance to get involved in the discussion until after the PEP was already approved.

Python is big enough now that there should be some mechanism for feedback from the community before the priesthood of python-dev writes something in stone. Currently, if you want to know about PEPs before they are approved, you have to subscribe to both python-dev and python-3000 (which isn't linked from either the mailing lists page or the dev page, btw). I really don't care about the vast majority of these lists' traffic but PEPs, at least some of them, are important.

If the python-dev summaries ever got updated this might be a potential solution, but even at their best I don't remember them ever getting closer than a month behind or so. And two weeks is probably too coarse-grained anyway.

I think what python.org really needs is a PEP rss feed. A friend thought that they already had one, but neither he nor I could find it. So if it exists, it's well-hidden. If it doesn't exist, it should. Please?

(And if it's easier for whoever's in charge of such things to give me access to the server and repository than to do it himself, then yes, I'm volunteering.)

Wednesday, April 18, 2007

Best Python book for beginners

It's really surprisingly difficult for someone who has been programming for a long time to write about programming at a level appropriate for real beginners. The first time I taught a class full of beginners at Neumont, I tried to take things as slow as possible. Then I spent the next week covering the material from the first day even slower.
So when the UGIC asked me to recommend a book to get for the participants in the Introduction to Python, I looked at all the ones I could find, but they all either assumed too much existing knowledge or covered material that would just confuse a beginner. Often both. But then Michael Bernstein pointed me to Python for Dummies.
If you're looking to teach beginners, or you're a beginner yourself, Python for Dummies is by far the best option. There's a few sections that are strikingly inappropriate for a book at its level (new-style classes!?) but it's still much, much better than any of the other books on the market in this respect. As a bonus, it's also one of the few that covers Python 2.5.

Introduction to Python slides

Here are the slides from my introduction to python at the UGIC conference today.

This presentation was meant for people with little to no programming experience. So I deliberately kept it pretty basic, and in fact in 90 minutes we only covered up to about slide 20 in the pdf. I also added an exercise before moving on to slide 10. ("Read 3 integers into a list, and print the sum.")

There were 17? people there (which was the room's capacity), so it was very nice to have Kevin Bell also answering questions individually during the exercises.

Monday, April 16, 2007

Mercurial presentation slides

Thursday I presented on distributed source control and Mercurial to the utah python ug. Here are my slides.

Then on Friday, Mozilla announced that they're moving from CVS to Mercurial, joining OpenSolaris and Xen and others on hg.

It's exciting to see what is still a small and elegant tool gain traction like this, even though in some ways hg (and dscm in general really) is still in the early adopter stage.

Tuesday, April 10, 2007

Mozy code deathmatch

My employer, the creator of Mozy, is running a programming contest this Saturday. 9 languages are allowed. The first 2 rounds are online; the finals are in American Fork (Utah), but if you make it that far you're guaranteed to win some money.

(We did this last year too; this year the prize money is doubled to $20k. Not to mention how we are super-experienced contest organizers now!)

Monday, April 02, 2007

New mailing list for utah python user group

Since I neglected to archive the old list when moving utahpython.org to a new server, the utah python user group has a new mailing list courtesy of Google Groups. (At least this way we're not dependent anymore on my incompetent sysadminning.)

Thursday, March 29, 2007

One thing I don't hate about Python

Sure, some things about Python bug me. But that's not what this is about. I wanted to react to Jacob Kaplan-Moss's gripes instead of promulgating my own. Specifically, his problem with Python's interfaces, or lack thereof.

I think I can keep this brief: interfaces are a hack that Java uses because Gosling et al thought multiple inheritance was too confusing and/or dangerous. (I believe I've read something recently where Gosling said that this was one decision he might do differently if he were re-designing Java now with the benefit of hindsight, but I can't find the source. Anyone remember seeing that?)

Python has MI. It doesn't need interfaces. I'm a little baffled that someone on the django core team would cite this as a problem with Python.

Jacob's precise objection is,

I shouldn’t need to care care about the difference between something that pretends to be a list and something that really is a list.

That's just it! You don't! But of course what Jacob really means is, "It should be easy to discover what methods a library expects to find on MY object that pretends to be a list." Which seems reasonable. And sure, good documentation is always welcome.

But when you cross the line to an Interface, at least the kind of Interface where Python itself would raise an error if I ignored the recommendation and left a method out (because I knew it wasn't necessary), that's bondage & discipline. That's not Python.

Friday, March 02, 2007

Introduction to Python at UGIC conference

I'll be giving a (very!) introductory Python workshop at the Utah Geographic Information Council conference in April. After my 90 minutes, Kevin Bell -- also of the utah python user group -- will present on specific GIS applications.

(Apparently Python is particularly big in GIS these days because one of the big vendors, ERSI, takes Python pretty seriously.)

Saturday, February 24, 2007

PyCon web frameworks panel notes

I represented Spyce for the web frameworks panel. It was pretty cool looking out at the standing-room-only crowd, even though let's face it, most people were not there because of Spyce. :)

James Bennett and Matt Harrison have notes posted online.

Friday, February 23, 2007

PyCon SqlSoup slides

I uploaded my slides for my Sunday SqlSoup talk, so they're linked in the schedule now. I also uploaded them here.

Update: the profile.py module I showed is here.

PyCon open-space talk on Spyce after the web frameworks panel

I'll demonstrate writing a simple app with Spyce, including the most painless Ajax you have ever seen. Come check it out at 3:40 in the Bent Tree II room.

Thursday, February 22, 2007

PyCon SQLAlchemy tutorial slides

My SQLAlchemy tutorial went pretty well for the most part. It was a fast pace but most people kept up pretty well. If I did it again I would add more of an intro to ORM in general for people who had never used one, but over half the attendees had used SO or django's or tried SA already. I would also paste more code from my slides into the samples download to save people typing during the exercises (I had some, but I would do more next time).

I think most people liked it; the main exception was one fellow who was in way way over his head and visibly pissed about it. (I used a list comprehension at one point and he had no idea what it was.)

The slides are here. (The .py files referred to in the slides have also been moved to the jellis/ subdirectory.)

Tuesday, February 20, 2007

Spyce at PyCon

I'll be representing Spyce as a late addition to the Web Frameworks panel. I'm also planning a lightning talk on Ajax in Spyce 2.2 (which will be released as soon as I finish getting the docs in shape) and an open-space Introduction to Spyce.

See you there!

Thursday, February 15, 2007

Best Google Tech Talks?

My wife got me a PSP for Valentine's Day, so I'm looking for videos to put on it. Since Google makes it so easy to get a PSP-compatible version, I thought I'd start with theirs... Recommendations?

Wednesday, February 14, 2007

SQLAlchemy slides

I presented on SQLAlchemy at the Utah python user group last Thursday; slides are linked here.

In retrospect, for a shorter presentation like this I should probably spend more time talking about the ORM features, and less about the SQL layer. Although the SQL layer is useful on its own, and essential for doing advanced mapping, I don't think it has the sex appeal that the ORM has.

(Although I do think the first part, about why ORMs should allow you to take advantage of your database's strengths rather than being limited to a MySQL 3 feature set, was useful.)

Thursday, January 25, 2007

Komodo 4 released; new free version

ActiveState has released Komodo IDE 4. Perhaps more interesting, if you're not already a Komodo user, is the release of Komodo Edit, which is very similar to the old Komodo IDE Personal edition, only instead of costing around $30, Komodo Edit is free. The mental difference between "free" and "$30" is much more than the relatively small amount of money; it will be interesting to see what happens in the IDE space now.

After a brief evaluation I would say Edit is perhaps the strongest contender for "best free python IDE." The only serious alternative is PyDev, which on its Eclipse foundation provides features like svn integration that Edit doesn't. PyDev also includes a debugger, another feature ActiveState would like to see you upgrade to the full IDE for. But Komodo is stronger in other areas such as call tips and, well, not being based on Eclipse. I also think its code completion is better, although this impression is preliminary.

It's also worth noting that so far, Edit doesn't sport the "Non-commercial and educational use only" restrictions that Komodo Personal had.

Monday, January 22, 2007

Wednesday, January 17, 2007

Caution: upgrading to new version of blogger may increase spam

I was pretty happy with the old version of blogger, but I upgraded today so I can use the new API against my own blog. So far I have 4 spam comments (captcha is still on) versus about that number for the entire life of my blog under the old blogger. Bleh.

Could just be a coincidence. I hope so.

(Update Feb 26: A month later, I've had just one more spam comment. So it probably really was just coincidence.)

Monday, January 15, 2007

Abstract of "Advanced PostgreSQL, part 1"

In December, Fujitsu made available a video of Gavin Sherry speaking on Advanced PostgreSQL. (Where's part 2, guys?) Here's some of the topics Gavin addresses, and the approximate point at which they can be found in the video.

[start]
wal_buffers: "at least 64"; when it's ok to turn fsync off [not very often]; how hard disk rpm limits write-based transaction rate, even with WAL
00:12:
wal_sync_method = fdatasync is worth checking out on Linux
00:13:
FSM [free space map], MVCC, and vacuum; how to determine appropriate FSM size; why this is important to avoid VACUUM FULL
00:22:
vaccum_cost_delay
00:26:
background writer
00:30:
history of buffer replacement strategies
00:37:
scenarios where bgwriter is not useful
00:41:
how random_page_cost affects planner's use of indexes
00:47:
effective_cache_size
00:49:
logging; how to configure syslog to not hose your performance
00:52:
linux file system configuration
00:58:
solaris fs config
1:02:
raid; reliability; sata/scsi; battery-backed cache ("for $100, you can triple the write throughput of your system")
1:08:
tablespaces
1:12:
increasing pgsql_tmp performance for queries that exceed work_mem and how to tell if this is worth worrying about
1:15:40
cpu considerations

Friday, January 12, 2007

Why SQLAlchemy impresses me

One of the reasons ORM tools have a spotted reputation is that it's really, really easy to write a dumb ORM that works fine for simple queries but performs like molasses once you start throwing real data at it.

Let me give an example of a situation where, to my knowledge, only SQLAlchemy of the Python (or Ruby) ORMs is really able to handle things elegantly, without gross hacks like "piggy backing."

Often you'll see a one-to-many relationship where you're not always interested in all of the -many side. For instance, you might have a users table, each associated with many orders. In SA you'd first define the Table objects, then create a mapper that's responsible for doing The Right Thing when you write "user.orders."

(I'm skipping connecting to the database for the sake of brevity, but that's pretty simple. I'm also avoiding specifying columns for the Tables by assuming they're in the database already and telling SA to autoload them. Besides keeping this code shorter, that's the way I prefer to work in real projects.)

users = Table('users', metadata, autoload=True)
orders = Table('orders', metadata, autoload=True)

class User(object): pass
class Order(object): pass

mapper(User, users, 
       properties={
           'orders':relation(mapper(Order, orders), order_by=orders.c.id),
       })

That "properties" dict says that you want your User class to provide an "orders" attribute, mapped to the orders table. If you are using a sane database, SQLAlchemy will automatically use the foreign keys it finds in the relation; you don't need to explicitly specify that it needs to join on "orders.user_id = user.id."

We can thus write

for user in session.query(User).select():
    print user.orders

So far this is nothing special: most ORMs can do this much. Most can also specify whether to do eager loading for the orders -- where all the data is pulled out via joins in the first select() -- or lazy loading, where orders are loaded via a separate query each time the attribute is accessed. Either of these can be "the right way" for performance, depending on the use case.

The tricky part is, what if I want to generate a list of all users and the most recent order for each? The naive way is to write

class User:
    @property
    def max_order(self):
        return self.orders[-1]

for user in session.query(User).select():
    print user, user.max_order

This works, but it requires loading all the orders when we are really only interested in one. If we have a lot of orders, this can be painful.

One solution in SA is to create a new relation that knows how to load just the most recent order. Our new mapper will look like this:

mapper(User, users, 
       properties={
           'orders':relation(mapper(Order, orders), order_by=orders.c.id),
           'max_order':relation(mapper(Order, max_orders, non_primary=True), uselist=False, viewonly=True),
       })

("non_primary" means the second mapper does not define persistence for Orders; you can only have one primary mapper at a time. "viewonly" means you can't assign to this relation directly.)

Now we have to define "max_orders." To do this, we'll leverage SQLAlchemy's ability to map not just Tables, but any Selectable:

max_orders_by_user = select([func.max(orders.c.order_id).label('order_id')],
                            group_by=[orders.c.user_id]).alias('max_orders_by_user')
max_orders = orders.select(orders.c.order_id==max_orders_by_user.c.order_id).alias('max_orders')

"max_orders_by_user" is a subselect whose rows are the max order_id for each user_id. Then we use that to define max_orders as the entire order row joined to that subselect on user_id.

We could define this as eager-by-default in the mapper, but in this scenario we only want it eager on a per-query basis. That looks like this:

q = session.query(User).options(eagerload('max_order'))
for user in q.select():
    print user, user.max_order
For fun, here's the sql generated:
SELECT users.user_name AS users_user_name, users.user_id AS users_user_id,
    anon_760c.order_id AS anon_760c_order_id, anon_760c.user_id AS anon_760c_user_id,
    anon_760c.description AS anon_760c_description, 
    anon_760c.isopen AS anon_760c_isopen
FROM users LEFT OUTER JOIN (
    SELECT orders.order_id AS order_id, orders.user_id AS user_id, 
        orders.description AS description, orders.isopen AS isopen
    FROM orders, (
        SELECT max(orders.order_id) AS order_id
        FROM orders GROUP BY orders.user_id) AS max_orders_by_user
    WHERE orders.order_id = max_orders_by_user.order_id) AS anon_760c 
ON users.user_id = anon_760c.user_id 
ORDER BY users.oid, anon_760c.oid

In SQLAlchemy, easy things are easy; hard things take some effort up-front, but once you have your relations defined, it's almost magical how it pulls complex queries together for you.

.................

I'm giving a tutorial on Advanced Databases with SQLAlchemy at PyCon in February. Feel free to let me know if there is anything you'd like me to cover specifically.

MySQL backend performance

Vadim Tkachenko posted an interesting benchmark of MyISAM vs InnoDB vs Falcon datatypes. (Falcon is the new backend that MySQL started developing after Oracle bought InnoDB.) For me the interesting part is not the part with the alpha code -- Falcon is competitive for some queries but gets absolutely crushed on others -- but how InnoDB is around 30% faster than MyISAM. And these are pure selects, supposedly where MyISAM is best.

Of course this is a small benchmark and YMMV, but this is encouraging to me because it suggests that if I ever have to use MySQL, I can use a backend with transactions, real foreign key support, etc., without sucking too badly performance-wise.

(It also suggests that people who responded to the post on postgresql crushing mysql in a different benchmark by saying, "well, if they wanted speed they should have used MyISAM," might want to reconsider their advice.)

Wednesday, January 10, 2007

Fun with three-valued logic

I thought I was pretty used to SQL's three-valued logic by now, but this still caused me a minute of scratching my head:

# select count(*) from _t;
 count
-------
  1306
(1 row)

# select count(*) from _t2;
 count
-------
 19497
(1 row)
Both _t and _t2 are temporary tables of a single column I created with SELECT DISTINCT.
# select count(*) from _t where userhash in (select userhash from _t2);
 count
-------
   982
(1 row)

# select count(*) from _t where userhash not in (select userhash from _t2);
 count
-------
     0
(1 row)

Hmm, 982 + 0 != 1306...

Turns out there was a null in _t2; X in {set containing null} evaluates to null, not false, and negating null still gives null. (The rule of thumb is, any operation on null is still null.)

.................

I'm giving a tutorial on Advanced Databases with SQLAlchemy at PyCon in February. Feel free to let me know if there is anything you'd like me to cover specifically.

Tuesday, January 02, 2007

Good advice for Tortoise SVN users

My thinkpad R52's screen died a couple days ago. I decided that this time I was going to be a man and install Linux on my new machine: all our servers run Debian, and "apt-get install" is just so convenient vs manual package installation on Windows. And it looks like qemu is a good enough "poor man's vmware" that I could still test stuff in IE when necessary.

Alas, it was not to be. My new laptop is an HP dv9005, and although ubuntu's livecd mode ran fine, when it actually installed itself to the HDD and loaded X it did strange and colorful things to the LCD. Things that didn't resemble an actual desktop. When I told it to start in recovery mode instead it didn't even finish booting.

That was all the time I had to screw around, so I reinstalled Windows to start getting work done again. Which brings me (finally!) to this advice on tortoisesvn: it really puts teh snappy back in the tortoise. Thanks annonymous progblogger!