One of the reasons ORM tools have a spotted reputation is that it's really, really easy to write a dumb ORM that works fine for simple queries but performs like molasses once you start throwing real data at it.
Let me give an example of a situation where, to my knowledge, only SQLAlchemy of the Python (or Ruby) ORMs is really able to handle things elegantly, without gross hacks like "piggy backing."
Often you'll see a one-to-many relationship where you're not always interested in all of the -many side. For instance, you might have a users table, each associated with many orders. In SA you'd first define the Table objects, then create a mapper that's responsible for doing The Right Thing when you write "user.orders."
(I'm skipping connecting to the database for the sake of brevity, but that's pretty simple. I'm also avoiding specifying columns for the Tables by assuming they're in the database already and telling SA to autoload them. Besides keeping this code shorter, that's the way I prefer to work in real projects.)
users = Table('users', metadata, autoload=True) orders = Table('orders', metadata, autoload=True) class User(object): pass class Order(object): pass mapper(User, users, properties={ 'orders':relation(mapper(Order, orders), order_by=orders.c.id), })
That "properties" dict says that you want your User class to provide an "orders" attribute, mapped to the orders table. If you are using a sane database, SQLAlchemy will automatically use the foreign keys it finds in the relation; you don't need to explicitly specify that it needs to join on "orders.user_id = user.id."
We can thus write
for user in session.query(User).select(): print user.orders
So far this is nothing special: most ORMs can do this much. Most can also specify whether to do eager loading for the orders -- where all the data is pulled out via joins in the first select() -- or lazy loading, where orders are loaded via a separate query each time the attribute is accessed. Either of these can be "the right way" for performance, depending on the use case.
The tricky part is, what if I want to generate a list of all users and the most recent order for each? The naive way is to write
class User: @property def max_order(self): return self.orders[-1] for user in session.query(User).select(): print user, user.max_order
This works, but it requires loading all the orders when we are really only interested in one. If we have a lot of orders, this can be painful.
One solution in SA is to create a new relation that knows how to load just the most recent order. Our new mapper will look like this:
mapper(User, users, properties={ 'orders':relation(mapper(Order, orders), order_by=orders.c.id), 'max_order':relation(mapper(Order, max_orders, non_primary=True), uselist=False, viewonly=True), })
("non_primary" means the second mapper does not define persistence for Orders; you can only have one primary mapper at a time. "viewonly" means you can't assign to this relation directly.)
Now we have to define "max_orders." To do this, we'll leverage SQLAlchemy's ability to map not just Tables, but any Selectable:
max_orders_by_user = select([func.max(orders.c.order_id).label('order_id')], group_by=[orders.c.user_id]).alias('max_orders_by_user') max_orders = orders.select(orders.c.order_id==max_orders_by_user.c.order_id).alias('max_orders')
"max_orders_by_user" is a subselect whose rows are the max order_id for each user_id. Then we use that to define max_orders as the entire order row joined to that subselect on user_id.
We could define this as eager-by-default in the mapper, but in this scenario we only want it eager on a per-query basis. That looks like this:
q = session.query(User).options(eagerload('max_order')) for user in q.select(): print user, user.max_orderFor fun, here's the sql generated:
SELECT users.user_name AS users_user_name, users.user_id AS users_user_id, anon_760c.order_id AS anon_760c_order_id, anon_760c.user_id AS anon_760c_user_id, anon_760c.description AS anon_760c_description, anon_760c.isopen AS anon_760c_isopen FROM users LEFT OUTER JOIN ( SELECT orders.order_id AS order_id, orders.user_id AS user_id, orders.description AS description, orders.isopen AS isopen FROM orders, ( SELECT max(orders.order_id) AS order_id FROM orders GROUP BY orders.user_id) AS max_orders_by_user WHERE orders.order_id = max_orders_by_user.order_id) AS anon_760c ON users.user_id = anon_760c.user_id ORDER BY users.oid, anon_760c.oid
In SQLAlchemy, easy things are easy; hard things take some effort up-front, but once you have your relations defined, it's almost magical how it pulls complex queries together for you.
.................
I'm giving a tutorial on Advanced Databases with SQLAlchemy at PyCon in February. Feel free to let me know if there is anything you'd like me to cover specifically.
Comments
Should that be "The Right Thing"? Nice post.
You are using a sub-sub-select to get all users and their last order, then joining that to a sub-select to get the order information for that last order per user, and then again an outer select to join the users to get their information.
Why not use a single sub-select as a "relational" table to link the users and order tables. Like so:
SELECT u.*, o.*
FROM users AS u
LEFT JOIN
(SELECT user_id, max(order_id) AS order_id
FROM orders
GROUP BY user_id) AS last_order
ON u.user_id = last_order.user_id
INNER JOIN orders AS o ON last_order.order_id = o.order_id
ORDER BY u.user_id
I think the generic SQL that ORMs tend to generate might be useful only part of the time. That's why for really caring ORMs, they make it easy for you to customize the SQL to your unique needs.
Maybe iBATIS is one of the most famous and impressive ones in doing that?
I have my own as well, which is "impressive" in its own way. hehe. It follows more of the way iBATIS works, without the XML need and with lots of conventions. :-)
Anyway, just ask yourself how could your ORM become more popular and solid at the same time? Is it easy to get there?
Best of luck to you.
(SA is not "my" ORM; I just recognize good software when I see it.)
the eager loading queries Jonathan illustrates work just great with MySQL. SA has several hundred unit tests, dozens of which contain far more exotic queries than that, all of which pass 100% on MySQL and are in use on many production systems today.
also, SQLAlchemy has been adopted by virtually every sub-category of the Python community in less than a year after it was first released so we are doing OK in the popularity department.
For example, if you ask for the same data twice (or an obvious subset of the initially requested data) will the database be hit once or twice?
I recently wrote a caching database abstraction layer for an application and (while fun) it was a fair bit of work to get it to a minimally functional state. If SQLAlchemy did that I would seriously consider jumping on the bandwagon.
I've found things in the docs that imply something like this might be going on, but nothing explicit.
What you saw in the docs is probably the SA identity map, which makes it so if you load an instance in two different places, they will refer to the same object. But the database will still be queried twice, so it is not a cache in the sense you mean.
SA doesnt have an integrated second-level cache at the moment. i think theyre a little bit overrated and also its quite easy to write your own that is closely tailored to how your app functions (since theyre usually just dictionaries).
to design a generic one that does what everybody wants without being impossible to configure/understand is a large undertaking, but we should eventually get one. i need to make a huge pass through myghtyutils first and clean it up....that would probably be the backend.
In this line:
'max_order':relation(mapper(Order, max_orders, non_primary=True), uselist=False, viewonly=True),
How does SA know to join on the user_id column? You haven't set up any foreign keys, nor have you specified where to join.
I've got a tricky problem with an association table -- and I think this technique (of using selects) would work, but I'm having trouble connecting all the pieces.