Friday, January 12, 2007

Why SQLAlchemy impresses me

One of the reasons ORM tools have a spotted reputation is that it's really, really easy to write a dumb ORM that works fine for simple queries but performs like molasses once you start throwing real data at it.

Let me give an example of a situation where, to my knowledge, only SQLAlchemy of the Python (or Ruby) ORMs is really able to handle things elegantly, without gross hacks like "piggy backing."

Often you'll see a one-to-many relationship where you're not always interested in all of the -many side. For instance, you might have a users table, each associated with many orders. In SA you'd first define the Table objects, then create a mapper that's responsible for doing The Right Thing when you write "user.orders."

(I'm skipping connecting to the database for the sake of brevity, but that's pretty simple. I'm also avoiding specifying columns for the Tables by assuming they're in the database already and telling SA to autoload them. Besides keeping this code shorter, that's the way I prefer to work in real projects.)

users = Table('users', metadata, autoload=True)
orders = Table('orders', metadata, autoload=True)

class User(object): pass
class Order(object): pass

mapper(User, users, 
       properties={
           'orders':relation(mapper(Order, orders), order_by=orders.c.id),
       })

That "properties" dict says that you want your User class to provide an "orders" attribute, mapped to the orders table. If you are using a sane database, SQLAlchemy will automatically use the foreign keys it finds in the relation; you don't need to explicitly specify that it needs to join on "orders.user_id = user.id."

We can thus write

for user in session.query(User).select():
    print user.orders

So far this is nothing special: most ORMs can do this much. Most can also specify whether to do eager loading for the orders -- where all the data is pulled out via joins in the first select() -- or lazy loading, where orders are loaded via a separate query each time the attribute is accessed. Either of these can be "the right way" for performance, depending on the use case.

The tricky part is, what if I want to generate a list of all users and the most recent order for each? The naive way is to write

class User:
    @property
    def max_order(self):
        return self.orders[-1]

for user in session.query(User).select():
    print user, user.max_order

This works, but it requires loading all the orders when we are really only interested in one. If we have a lot of orders, this can be painful.

One solution in SA is to create a new relation that knows how to load just the most recent order. Our new mapper will look like this:

mapper(User, users, 
       properties={
           'orders':relation(mapper(Order, orders), order_by=orders.c.id),
           'max_order':relation(mapper(Order, max_orders, non_primary=True), uselist=False, viewonly=True),
       })

("non_primary" means the second mapper does not define persistence for Orders; you can only have one primary mapper at a time. "viewonly" means you can't assign to this relation directly.)

Now we have to define "max_orders." To do this, we'll leverage SQLAlchemy's ability to map not just Tables, but any Selectable:

max_orders_by_user = select([func.max(orders.c.order_id).label('order_id')],
                            group_by=[orders.c.user_id]).alias('max_orders_by_user')
max_orders = orders.select(orders.c.order_id==max_orders_by_user.c.order_id).alias('max_orders')

"max_orders_by_user" is a subselect whose rows are the max order_id for each user_id. Then we use that to define max_orders as the entire order row joined to that subselect on user_id.

We could define this as eager-by-default in the mapper, but in this scenario we only want it eager on a per-query basis. That looks like this:

q = session.query(User).options(eagerload('max_order'))
for user in q.select():
    print user, user.max_order
For fun, here's the sql generated:
SELECT users.user_name AS users_user_name, users.user_id AS users_user_id,
    anon_760c.order_id AS anon_760c_order_id, anon_760c.user_id AS anon_760c_user_id,
    anon_760c.description AS anon_760c_description, 
    anon_760c.isopen AS anon_760c_isopen
FROM users LEFT OUTER JOIN (
    SELECT orders.order_id AS order_id, orders.user_id AS user_id, 
        orders.description AS description, orders.isopen AS isopen
    FROM orders, (
        SELECT max(orders.order_id) AS order_id
        FROM orders GROUP BY orders.user_id) AS max_orders_by_user
    WHERE orders.order_id = max_orders_by_user.order_id) AS anon_760c 
ON users.user_id = anon_760c.user_id 
ORDER BY users.oid, anon_760c.oid

In SQLAlchemy, easy things are easy; hard things take some effort up-front, but once you have your relations defined, it's almost magical how it pulls complex queries together for you.

.................

I'm giving a tutorial on Advanced Databases with SQLAlchemy at PyCon in February. Feel free to let me know if there is anything you'd like me to cover specifically.

18 comments:

Anonymous said...

Your "that's the way I prefer to work" link isn't quite right...

Jonathan Ellis said...

Fixed.

Anonymous said...

"In SA you'd first define the Table objects, then create a mapper that's responsible for doing The Right Then when you write "user.orders."

Should that be "The Right Thing"? Nice post.

Jonathan Ellis said...

Quite right. Fixed that too. :)

Anonymous said...

Ugh, three levels of selects, that's some ugly SQL. I know MySQL loves to choke on that kind of query.

You are using a sub-sub-select to get all users and their last order, then joining that to a sub-select to get the order information for that last order per user, and then again an outer select to join the users to get their information.

Why not use a single sub-select as a "relational" table to link the users and order tables. Like so:

SELECT u.*, o.*
FROM users AS u
LEFT JOIN
(SELECT user_id, max(order_id) AS order_id
FROM orders
GROUP BY user_id) AS last_order
ON u.user_id = last_order.user_id
INNER JOIN orders AS o ON last_order.order_id = o.order_id
ORDER BY u.user_id

Jonathan Ellis said...

I found my way more intuitive, but I'm pretty sure a database with a decent optimizer will generate the same plan for both. (This may not include MySQL, as you say, but I don't see a point in writing baroque code to accommodate a database that I never use. :)

Anonymous said...

It doesn't impress me.

I think the generic SQL that ORMs tend to generate might be useful only part of the time. That's why for really caring ORMs, they make it easy for you to customize the SQL to your unique needs.

Maybe iBATIS is one of the most famous and impressive ones in doing that?

I have my own as well, which is "impressive" in its own way. hehe. It follows more of the way iBATIS works, without the XML need and with lots of conventions. :-)

Anyway, just ask yourself how could your ORM become more popular and solid at the same time? Is it easy to get there?

Best of luck to you.

Jonathan Ellis said...

SQLAlchemy allows custom SQL for several meanings of the phrase, but the point is, most of the time you don't need it. Which is the best of both worlds IMO.

(SA is not "my" ORM; I just recognize good software when I see it.)

Anonymous said...

Python and Ruby ORMs? Dunno. The main perl one handles this fine. DBIx::Class will happily prefetch all sorts of complex things in a single relation, and it's also trivial to select parts of a one-many rel to prefetch on the fly if you know that's all you're going to need.

Jonathan Ellis said...

I do not see anything in the DBIx::Class documentation that implies that it can do this cleanly (and I see some suggestions that it cannot), but if you are correct, then it's good that people who have to use Perl have that option. :)

Timur Izhbulatov said...

Good job! Thanks! I've just started playing with SA and your post is really helpful.

zzzeek said...

my impression is that iBatis is strictly a rowset-mapper. that in itself is maybe 10% of SQLAlchemy, which if you prefer allows completely straight textual SQL that can be turned into objects at the result set level just like iBatis. But theres no need for that level of tedium.

the eager loading queries Jonathan illustrates work just great with MySQL. SA has several hundred unit tests, dozens of which contain far more exotic queries than that, all of which pass 100% on MySQL and are in use on many production systems today.

also, SQLAlchemy has been adopted by virtually every sub-category of the Python community in less than a year after it was first released so we are doing OK in the popularity department.

Raven said...

Does SQLAlchemy do any kind of internal caching?

For example, if you ask for the same data twice (or an obvious subset of the initially requested data) will the database be hit once or twice?

I recently wrote a caching database abstraction layer for an application and (while fun) it was a fair bit of work to get it to a minimally functional state. If SQLAlchemy did that I would seriously consider jumping on the bandwagon.

I've found things in the docs that imply something like this might be going on, but nothing explicit.

Jonathan Ellis said...

No; the author of SA [rightly, IMO] considers caching a separate concern.

What you saw in the docs is probably the SA identity map, which makes it so if you load an instance in two different places, they will refer to the same object. But the database will still be queried twice, so it is not a cache in the sense you mean.

zzzeek said...

what youre talking about is a second level cache, which is distinct from SA's "session" which is cache-like, but thats not its main purpose (identity mapping is its main purpose).

SA doesnt have an integrated second-level cache at the moment. i think theyre a little bit overrated and also its quite easy to write your own that is closely tailored to how your app functions (since theyre usually just dictionaries).

to design a generic one that does what everybody wants without being impossible to configure/understand is a large undertaking, but we should eventually get one. i need to make a huge pass through myghtyutils first and clean it up....that would probably be the backend.

Matt said...

Jonathan,

In this line:
'max_order':relation(mapper(Order, max_orders, non_primary=True), uselist=False, viewonly=True),

How does SA know to join on the user_id column? You haven't set up any foreign keys, nor have you specified where to join.

I've got a tricky problem with an association table -- and I think this technique (of using selects) would work, but I'm having trouble connecting all the pieces.

Jonathan Ellis said...

I did have FKs in the database, so autoload picked them up. Sorry that wasn't clear.

love encounter flow said...

i am impressed by sqlalchemy too but i recently dumped it in favor of storm. i've been finding myself far too often in the jungle of documentation, trying to figure out how to do this, how to do that. frequently i was unable to come up with a solution and had to write sql anyway. sqlalchemy is just too complicated, and i feel i have to jump thru too many hoops to get things done. also i got a lot of strange errors i was unable to resolve with that relations things, when you tie several records from several tables together. i write a lot more raw sql now which i do not particularly like, but at least the level of complexity has decreased. using a db backend should not be so hard. my suspicion is that the current way to use orms to access db data is ultimately not the way to go, that there is a better solution waiting to be discovered.