Thursday, January 25, 2007

Komodo 4 released; new free version

ActiveState has released Komodo IDE 4. Perhaps more interesting, if you're not already a Komodo user, is the release of Komodo Edit, which is very similar to the old Komodo IDE Personal edition, only instead of costing around $30, Komodo Edit is free. The mental difference between "free" and "$30" is much more than the relatively small amount of money; it will be interesting to see what happens in the IDE space now.

After a brief evaluation I would say Edit is perhaps the strongest contender for "best free python IDE." The only serious alternative is PyDev, which on its Eclipse foundation provides features like svn integration that Edit doesn't. PyDev also includes a debugger, another feature ActiveState would like to see you upgrade to the full IDE for. But Komodo is stronger in other areas such as call tips and, well, not being based on Eclipse. I also think its code completion is better, although this impression is preliminary.

It's also worth noting that so far, Edit doesn't sport the "Non-commercial and educational use only" restrictions that Komodo Personal had.

Monday, January 22, 2007

Wednesday, January 17, 2007

Caution: upgrading to new version of blogger may increase spam

I was pretty happy with the old version of blogger, but I upgraded today so I can use the new API against my own blog. So far I have 4 spam comments (captcha is still on) versus about that number for the entire life of my blog under the old blogger. Bleh.

Could just be a coincidence. I hope so.

(Update Feb 26: A month later, I've had just one more spam comment. So it probably really was just coincidence.)

Monday, January 15, 2007

Abstract of "Advanced PostgreSQL, part 1"

In December, Fujitsu made available a video of Gavin Sherry speaking on Advanced PostgreSQL. (Where's part 2, guys?) Here's some of the topics Gavin addresses, and the approximate point at which they can be found in the video.

[start]
wal_buffers: "at least 64"; when it's ok to turn fsync off [not very often]; how hard disk rpm limits write-based transaction rate, even with WAL
00:12:
wal_sync_method = fdatasync is worth checking out on Linux
00:13:
FSM [free space map], MVCC, and vacuum; how to determine appropriate FSM size; why this is important to avoid VACUUM FULL
00:22:
vaccum_cost_delay
00:26:
background writer
00:30:
history of buffer replacement strategies
00:37:
scenarios where bgwriter is not useful
00:41:
how random_page_cost affects planner's use of indexes
00:47:
effective_cache_size
00:49:
logging; how to configure syslog to not hose your performance
00:52:
linux file system configuration
00:58:
solaris fs config
1:02:
raid; reliability; sata/scsi; battery-backed cache ("for $100, you can triple the write throughput of your system")
1:08:
tablespaces
1:12:
increasing pgsql_tmp performance for queries that exceed work_mem and how to tell if this is worth worrying about
1:15:40
cpu considerations

Friday, January 12, 2007

Why SQLAlchemy impresses me

One of the reasons ORM tools have a spotted reputation is that it's really, really easy to write a dumb ORM that works fine for simple queries but performs like molasses once you start throwing real data at it.

Let me give an example of a situation where, to my knowledge, only SQLAlchemy of the Python (or Ruby) ORMs is really able to handle things elegantly, without gross hacks like "piggy backing."

Often you'll see a one-to-many relationship where you're not always interested in all of the -many side. For instance, you might have a users table, each associated with many orders. In SA you'd first define the Table objects, then create a mapper that's responsible for doing The Right Thing when you write "user.orders."

(I'm skipping connecting to the database for the sake of brevity, but that's pretty simple. I'm also avoiding specifying columns for the Tables by assuming they're in the database already and telling SA to autoload them. Besides keeping this code shorter, that's the way I prefer to work in real projects.)

users = Table('users', metadata, autoload=True)
orders = Table('orders', metadata, autoload=True)

class User(object): pass
class Order(object): pass

mapper(User, users, 
       properties={
           'orders':relation(mapper(Order, orders), order_by=orders.c.id),
       })

That "properties" dict says that you want your User class to provide an "orders" attribute, mapped to the orders table. If you are using a sane database, SQLAlchemy will automatically use the foreign keys it finds in the relation; you don't need to explicitly specify that it needs to join on "orders.user_id = user.id."

We can thus write

for user in session.query(User).select():
    print user.orders

So far this is nothing special: most ORMs can do this much. Most can also specify whether to do eager loading for the orders -- where all the data is pulled out via joins in the first select() -- or lazy loading, where orders are loaded via a separate query each time the attribute is accessed. Either of these can be "the right way" for performance, depending on the use case.

The tricky part is, what if I want to generate a list of all users and the most recent order for each? The naive way is to write

class User:
    @property
    def max_order(self):
        return self.orders[-1]

for user in session.query(User).select():
    print user, user.max_order

This works, but it requires loading all the orders when we are really only interested in one. If we have a lot of orders, this can be painful.

One solution in SA is to create a new relation that knows how to load just the most recent order. Our new mapper will look like this:

mapper(User, users, 
       properties={
           'orders':relation(mapper(Order, orders), order_by=orders.c.id),
           'max_order':relation(mapper(Order, max_orders, non_primary=True), uselist=False, viewonly=True),
       })

("non_primary" means the second mapper does not define persistence for Orders; you can only have one primary mapper at a time. "viewonly" means you can't assign to this relation directly.)

Now we have to define "max_orders." To do this, we'll leverage SQLAlchemy's ability to map not just Tables, but any Selectable:

max_orders_by_user = select([func.max(orders.c.order_id).label('order_id')],
                            group_by=[orders.c.user_id]).alias('max_orders_by_user')
max_orders = orders.select(orders.c.order_id==max_orders_by_user.c.order_id).alias('max_orders')

"max_orders_by_user" is a subselect whose rows are the max order_id for each user_id. Then we use that to define max_orders as the entire order row joined to that subselect on user_id.

We could define this as eager-by-default in the mapper, but in this scenario we only want it eager on a per-query basis. That looks like this:

q = session.query(User).options(eagerload('max_order'))
for user in q.select():
    print user, user.max_order
For fun, here's the sql generated:
SELECT users.user_name AS users_user_name, users.user_id AS users_user_id,
    anon_760c.order_id AS anon_760c_order_id, anon_760c.user_id AS anon_760c_user_id,
    anon_760c.description AS anon_760c_description, 
    anon_760c.isopen AS anon_760c_isopen
FROM users LEFT OUTER JOIN (
    SELECT orders.order_id AS order_id, orders.user_id AS user_id, 
        orders.description AS description, orders.isopen AS isopen
    FROM orders, (
        SELECT max(orders.order_id) AS order_id
        FROM orders GROUP BY orders.user_id) AS max_orders_by_user
    WHERE orders.order_id = max_orders_by_user.order_id) AS anon_760c 
ON users.user_id = anon_760c.user_id 
ORDER BY users.oid, anon_760c.oid

In SQLAlchemy, easy things are easy; hard things take some effort up-front, but once you have your relations defined, it's almost magical how it pulls complex queries together for you.

.................

I'm giving a tutorial on Advanced Databases with SQLAlchemy at PyCon in February. Feel free to let me know if there is anything you'd like me to cover specifically.

MySQL backend performance

Vadim Tkachenko posted an interesting benchmark of MyISAM vs InnoDB vs Falcon datatypes. (Falcon is the new backend that MySQL started developing after Oracle bought InnoDB.) For me the interesting part is not the part with the alpha code -- Falcon is competitive for some queries but gets absolutely crushed on others -- but how InnoDB is around 30% faster than MyISAM. And these are pure selects, supposedly where MyISAM is best.

Of course this is a small benchmark and YMMV, but this is encouraging to me because it suggests that if I ever have to use MySQL, I can use a backend with transactions, real foreign key support, etc., without sucking too badly performance-wise.

(It also suggests that people who responded to the post on postgresql crushing mysql in a different benchmark by saying, "well, if they wanted speed they should have used MyISAM," might want to reconsider their advice.)

Wednesday, January 10, 2007

Fun with three-valued logic

I thought I was pretty used to SQL's three-valued logic by now, but this still caused me a minute of scratching my head:

# select count(*) from _t;
 count
-------
  1306
(1 row)

# select count(*) from _t2;
 count
-------
 19497
(1 row)
Both _t and _t2 are temporary tables of a single column I created with SELECT DISTINCT.
# select count(*) from _t where userhash in (select userhash from _t2);
 count
-------
   982
(1 row)

# select count(*) from _t where userhash not in (select userhash from _t2);
 count
-------
     0
(1 row)

Hmm, 982 + 0 != 1306...

Turns out there was a null in _t2; X in {set containing null} evaluates to null, not false, and negating null still gives null. (The rule of thumb is, any operation on null is still null.)

.................

I'm giving a tutorial on Advanced Databases with SQLAlchemy at PyCon in February. Feel free to let me know if there is anything you'd like me to cover specifically.

Tuesday, January 02, 2007

Good advice for Tortoise SVN users

My thinkpad R52's screen died a couple days ago. I decided that this time I was going to be a man and install Linux on my new machine: all our servers run Debian, and "apt-get install" is just so convenient vs manual package installation on Windows. And it looks like qemu is a good enough "poor man's vmware" that I could still test stuff in IE when necessary.

Alas, it was not to be. My new laptop is an HP dv9005, and although ubuntu's livecd mode ran fine, when it actually installed itself to the HDD and loaded X it did strange and colorful things to the LCD. Things that didn't resemble an actual desktop. When I told it to start in recovery mode instead it didn't even finish booting.

That was all the time I had to screw around, so I reinstalled Windows to start getting work done again. Which brings me (finally!) to this advice on tortoisesvn: it really puts teh snappy back in the tortoise. Thanks annonymous progblogger!