Monday, January 02, 2006

ORM design part 2

Glyph Lefkowitz cites me as an inspiration to write Why Axiom Doesn't Expose SQL. Alas, I disagree with most of what he says.

My post was about how if you're writing a tool that presupposes the use of a relational database, it's stupid to try to protect your users from having to know SQL. (This also means I think projects that bend over backwards to pretend ALTER TABLE is too hard are misguided, as well. But that's another subject.)

Glyph's first argument is that any form of SQL is an invitation to sql injection attacks. This particular form of scare mongering isn't appreciated. Come on: this is 2005. It's ridiculously easy to write injection-proof SQL, even by hand. Arguing that allowing SQL allows injection attacts is like arguing that coding in python allows "shutil.rmtree('/')": correct, but irrelevant.

Glyph further claims that "interfaces should be complete things," and that this justifies trying to obliterate any trace of SQL from your ORM. Because if you need SQL, your interface just wasn't complete enough.

The thing is, an ORM isn't just an interface: it's an extra layer of abstraction. Abstraction that obscures or prevents access to its internals smacks of the kind of we-know-better-than-you thinking that infects most statically typed language libraries. Don't do this.

Trying to design for "everything the user could possibly want, or at least everything we think he should want," while easier than providing for the unexpected, is almost always doomed to failure. Glyph gives the example of the Python os module -- why use C when os gives you everything you want? -- and yet, my current work project involves multiple C modules. If the Python designers had taken the attitude of "we'll give you all the everything we think you need, and if we leave out something you want, too bad," Python would have been a non-candidate for the job, and I'd be working somewhere else.

Fundamentally, tool designers more concerned about providing power and usability know that sometimes you need to go to a lower level of abstraction, and design for this up-front. Just like Python.

I do agree completely with Glyph on one point:

[M]ost ORMs are really, really awful. They are heavy on the "object" and not so much on the "database". There are a lot of features that SQL provides which they don't expose.

It sometimes seems like we're rapidly approaching the point where there will be as many half-assed Python ORM tools as there are half-assed Python web frameworks. Don't build another half-assed tool. Pick an existing project that sucks less than the others and improve it. One ORM that doesn't suck much (I mean that as a compliment) is PyDO2. Another is SQLAlchemy. I've been trying SQLAlchemy out during the last few days, and I really like the direction it's going in. (Don't be too intimidated by the introduction: it looks overcomplicated, but it's actually exceptionally well thought-out.)

18 comments:

Glyph Lefkowitz said...

Hi, Jonathan :)

As far as it's "ridiculously easy" to write injection-proof SQL - it's almost as easy to write buffer-overflow-proof C. Yet, here we are, it's 2006 now, and otherwise bright people are still writing C code with buffer overflows in it and web-based relational database access code with tons of SQL injection vulnerabilities.

In the article you cite about "we-know-better-than-you thinking", you say that code generation is "fragile at best". Why is SQL code generation less fragile than Python or C# code generation? If you were writing a program that would generate Python code (however strict its quoting rules), I would, rightly, object that it opened the door to shutil.rmtree attacks. Similarly, if you are actually writing your code in SQL, rather than generating SQL from some other language, I suspect it would not be so vulnerable to attacks.

All I'm saying is that you should anticipate the unexpected by writing a non-crappy ORM, not writing a crappy one and letting your users patch it up by executing their own SQL. The C modules that you're using, I assume, aren't interfaces to POSIX functionality, but new functionality in their own right. The equivalent in an ORM is providing graceful access to stored procedures. That doesn't mean that you should be generating the stored procedures (or for that matter, C functions) from your application code.

Jonathan Ellis said...

There are significantly more ways to shoot yourself in the foot with buffer overflows in C than there are to expose yourself to an SQL injection, and the defense mechanisms require much more state to be kept in a programmer's head. Equating the two is just silly. Frankly it sounds like you're saying "I got burned by sql injection so it must be a Hard Problem." It's okay to screw up; now you know better. But it's a solved problem for the majority of the world!

As to code generation, you're conflating two issues:

1) runtime generation is much less fragile than compile-time for some areas, such, as, well, database introspection

2) generation of code that doesn't need to be hand-modified afterwards (most SQL) is inherently more robust than generation of code that does (most Python or C#)

Finally, for C modules, we did indeed have to write our own wrapper for some POSIX pthread functions in order to get decent condition variable performance. Sometimes that 1% of code you leave out of your tool that does 99% of what users need is the dealbreaker... but every user will have a different 1%.

Jonathan Ellis said...

... I should perhaps note that picking on code generation is something of a straw man. The main point of that earlier post was, as I quoted Sion,

"[D]enying derived classes full access to your inner workings just leads to clumsier (less readable and more bug-prone) implementations derivations... [I]t's based on the hubris that you are a better programmer than anyone who might want to extend your class and can forsee all circumstances in which it might be subclassed."

The application to ORM tools that try to keep you away from SQL is, I think, clear.

Jacob S. said...

Different ORMs try to play quite different roles. Everyone agrees that it is ideal for an application to provide a high-level interface to its database layer which conceals its implementation, and to which one sends very simple messages rather than querying directly with SQL. An ORM library may aim to be a complete wrapper and enforce this separation itself, possibly at the cost of forcing you to use the database in particular way, whether good or bad, or it may merely facilitate your (the software developer's) enforcement of that separation in the creation of your own abstraction layer. The choice here is a matter of taste, probably; PyDO for one prefers the second approach. No matter how pure your ORM is, you can write impure code with it; this being the case, I'd personally rather have a good compromise between convenience and flexibility. Simplicity is nice, too.

The comparison with C and memory management isn't entirely apropos, I think. With C, you are forced to manage memory everywhere; it pervades everything you do, and is usually irrelevant to everything you really want to be thinking about while writing an application. An ORM that permits you to use SQL directly doesn't force you to use it everywhere; and when you do use it, the reason for doing so is likely to be relevant to your purpose in writing the application, so you'll think about it as carefully as you know how.

Ian Bicking said...

I'm unclear what the criticism of sqlobject-admin is. That it doesn't generate any ALTER statements for you, forcing you to write them by hand?

Jonathan Ellis said...

Briefly, the criticism of sqlobject-admin is, it's a solution in search of a problem. There already exists a well-defined, standard, supported-by-many-tools method of interacting with your database in DDL, and sqlobject-admin wants to discourage in favor of more verbose, nonstandard, unsupported-by-other-tools sqlobject class definitions... why?

Anonymous said...

(troll)
"everything the user could possibly want, or at least everything we think he should want" should be part of the Twisted motto. The rest could be:

* in the core package (serial line support)
* with some bizzare, non-standard-compliant implementation (https)
* while reimplementing half the Python standard library b/c it's synchronous and we're smarter (logging)

What did you expect from Glyph?
(/troll)

That said, Twisted's a great library.

Ken Kinder said...

Using an ORM primarily to prevent sql-injection attacks is, to say the least, overkill. SQL injection attacks are quite rare in Python because the DBAPI gives you such a great interface for escaping values.

That said, I've found the most compelling reason for using an ORM is to automate tasks where you need don't know ahead of time what your query will look like.

I've found that in many of those instances, simple methods that map dictionaries to queries and back are quite sufficient, like dict2insert or dict2update.

But I still love SQLObject. :)

Ian Bicking said...

I think you misunderstand the role of sqlobject-admin. Code almost always expects a certain schema. In SQLObject that is made explicit, because the class definition also is (generally) sufficient to determine exactly what schema you expect. But whether the expectation is explicit or implicit, it is always there.

sqlobject-admin manages the differences between the code's expectations and the actual database. So you can record versions, upgrade a database (via hand-written SQL), create a database from scratch, and see some of the differences between the database and the expected schema.

This isn't reinventing the database DDL, but simply a tool that addresses fairly universal issues. You could write the same thing with DDL parsers and whatnot, or the databases could build these sorts of tools in. But DDL parsers are hard and uncommon, and AFAIK databases don't include these kinds of tools.

As for SQL injection, the blame lays heavy on the shoulders of PHP. They just got it horribly wrong, so that doing the right thing looks *almost* like doing the very wrong thing, and so that the wrong thing is easier. That's just dumb design, I don't think it really implies much about larger concepts in SQL and database abstraction. It does imply a lot about what your failure cases should be, but that's a different discussion.

Glyph Lefkowitz said...

Well, I'm happy to have gotten such a lively discussion going :-).

I'll stop trolling for just a moment here to point out another vector of agreement that's probably non-obvious from the way I've been talking about this. Of course your ORM should let you get at SQL, just like Python should let you get at C.

If you've written a crappy ORM that needs patching up, the last thing your users are going to want to do is to crack open your source code to start diddling things around so they can get to the SQL, already. As the python community motto goes, "we're all adults here". Axiom does have a way to get at your SQL, "Store.executeSQL". (And by the way, that method will actually raise an exception if you attempt to insert SQL strings by hand rather than using bind parameters ;-))

My main point here is that every time a user has to go and call that method, I should recognize it as a failure of my code and attempt to fix it (as I believe other ORM authors should), rather than just accepting "Oh, all my users should know SQL, so if you want to access feature XYZ of course you have to generate some SQL strings yourself".

Marius Gedminas said...

I just wanted to note that the notional link to "Why Axiom Doesn't Expose SQL" actually points to the front page of blogger.com

Shalabh said...

Python cannot beat domain specific languages, specially good domain specific languages such as SQL, in expressivity. That is why most people familiar with SQL will always feel constrained with any wrapper/mapper in Python. It doesn't matter if an SQL 'feature' is provided, the fact that it's in Python not SQL has already taken it away from the relational model. Unless you have a relational wrapper in Python (which is different from an OR mapper, btw).

From the little I've seen SQLAlchemy does a good job of exposing both - a relational wrapper as well as an OR mapper. Otherwise if you have a pure OR mapper, the users will always have to compromise on control, expressivity and/or performance. Of course, the benefit of gaining persistent objects may justify the use in many cases.

Florian said...

I do know SQL fairly enough and yet I do not wish to use it directly, since ORM semantics may not yield the full expressiveness of SQL, but they tie nicely in with my code, which SQL simply doesn't.

I like sqlobject-admin for beeing an easy tool to make me worry less about how to setup the correct database from my ORM modell, +1.

One of a the nice things to have would be a tool that helps you convert one schema of a database to another mostly automated. Would quite take a bit of the edge out of changing schemas.

Anonymous said...

Real-life SQL Injection vulnerabilities:

http://www.securityfocus.com/bid/15912
http://www.securityfocus.com/bid/13620
http://www.securityfocus.com/bid/13513
http://www.securityfocus.com/bid/16020
http://www.securityfocus.com/bid/16039
http://www.securityfocus.com/bid/13030
http://www.securityfocus.com/bid/16038
http://www.securityfocus.com/bid/16097
http://www.securityfocus.com/bid/15088
http://www.securityfocus.com/bid/15676
http://www.securityfocus.com/bid/14618
http://www.securityfocus.com/bid/15598
http://www.securityfocus.com/bid/13649
http://www.securityfocus.com/bid/13643
http://www.securityfocus.com/bid/15330
http://www.securityfocus.com/bid/13355
http://www.securityfocus.com/bid/15787

Mostly from 2005. There's lots more. I got tired of pasting links.

Solved problem? No. No it isn't.

mike bayer said...

Anonymous said...

Real-life SQL Injection vulnerabilities:
http://www.securityfocus.com/bid/15912
http://www.securityfocus.com/bid/13620
http://www.securityfocus.com/bid/13513
...
Solved problem? No. No it isn't.

OK, now paste one that isnt solved by using bind parameters properly.

Solved problem ? Yes. Yes it is.

Jayson Vantuyl said...

Actually, I would go so far as to say that Axiom uses SQL in a few ways that exposing SQL would make impractical.

First, any ORM must be able to know what it can cache, what it can't, and how everything stands in the face of transactions. Exposing SQL can make this difficult.

Second, Axiom does a lot of automatic table munging on the fly (schema upgrades, entity references, List objects) that would be almost impossible with exposing of SQL.

In the end, I think that Glyph probably missed the mark a bit. There are interesting objecty things that can be done with Axiom. There is enough of an interface to the relational model in the Axiom's queries and references that you can still take advantage of the underlying SQLs speed.

That said, implementing something in Axiom generally would make it almost impossible to properly munge directly through SQL because of how complex its data management is (especially across versions).

There is also serious responsibility that comes with keeping both the coding API and SQL storage implementation solid across versions. This is because Axiom uses SQL very differently from most ORMs. Look at powerups, for example. Nobody else does this as elegantly.

This is not a problem with Axiom, its a strength of it. It implements a ODB with some of the speed and power of a RDB. It couldn't do it if you were mucking about with the internals.

One other nicety of not exposing SQL is that it makes it fairly simple to port to another database. If someone dropped the right amount of money, I bet you could see Axiom running on MySQL, PostgreSQL, or Oracle in a matter of days. Interestingly, not a single Axiom derived application would break.

Anyone who has watched the efforts to port any large SQL-based system to another database knows the trouble this saves. I, for one, believe that anything that can be expressed in vanilla SQL can be represented with an appropriate API under any database system. SQL is really too simple to make this difficult. It's one of SQL's strengths.

Axiom doesn't much expose SQL for a these very good reasons. Preventing the underskilled from creating SQL injection attacks is a useful as a side effect. Tts focus of this discussion is likely an unnecessary distraction. The real reason that Axiom doesn't expose it is simply because it makes Axiom too complex to be usefully maintained and it makes a number of very useful things far too difficult.

Jonathan Ellis said...

Ian: I don't think I'm comfortable with this idea that "if a database conforms to schema A, and my code expects schema B, then the right thing to do is to automatically change A to B." Because A is different for a reason. Sometimes that reason is just that A is an older version but not always.

But you are right that this can be a useful tool.

Max Belugin said...

I think the problem is in th following: ORM is about mapping object model to relational model. While object is a good enough structural element to be mapped to the database, there is not good enough thing to be mapped to query language. We have imperative language, but SQL is declarative and though better to do operations with large amount of data.

We need a declarative model to operate with objects.
And Bruce Eckel agree with me :).

And once it would be created we get object advantages from these queries so for example querying
[x for x in employee if x.age>25]
will not depend on whether age is a stored field or a property (today()-self.birthdate)

SQL in ORM is a kind of embedded assembler in the high level languages: the current compromise between comfort and performance