Monday, February 06, 2006

Why schema definition belongs in the database

Earlier, I wrote about how ORM developers shouldn't try to re-invent SQL. It doesn't need to be done, and you're not likely to end up with an actual improvement. SQL may be designed by committee, but it's also been refined from thousands if not millions of man-years of database experience.

The same applies to DDL. (Data Definition Langage -- the part of the SQL standard that deals with CREATE and ALTER.) Unfortunately, a number of Python ORMs are trying to replace DDL with a homegrown Python API. This is a Bad Thing. There are at least four reasons why:

  • Standards compliance
  • Completeness
  • Maintainability
  • Beauty

Standards compliance

SQL DDL is a standard. That means if you want something more sophisticated than Emacs, you can choose any of half a dozen modeling tools like ERwin or ER/Studio to generate and edit your DDL.

The Python data definition APIs, by contrast, aren't even compatibile with other Python tools. You can't take a table definition from SQLObject, PyDO2, Django, or SQLAlchemy, and use it with any of the others.

A quote from the django site:

Our philosophy is that the model (in Python code) is the definitive place to store all model logic.

Yes, application logic belongs in application code. But the definitive source for the schema definition should be the database, unless you're using an object database like Durus or ZODB. Of course, the reason those (and OODBs in general) haven't taken off is that except for very simple applications, people want to access their data from more than just their Python code. So respect that. (Or be honest and require your OODB of choice.) Encourage standards use instead of a proprietary API. Let the database be the source of truth for the schema, and use standard tools to define it.

Completeness

Another strength of relational databases is the ability to define data integrity rules in a declarative fashion. It's well-understood by now that declarative code is much less error prone (and quicker to write) than the procedural equivalents. So if you're going to re-invent DDL, you need to support ON DELETE and ON UPDATE clauses. You need to support REFERENCES. You need to support CHECK constraints. You need to support defining triggers.

I don't know any ORM tool that allows doing all this from Python. So selling your tool as "you don't need to know DDL, just use our API" isn't doing users any favors.

It's okay to say, "here's our API; it lets you do a subset of what your database can handle so you can jump right in and get started without learning DDL." There's no harm in this. But when your tool is designed such that it doesn't expose the power of your database, but it doesn't really work with externally defined schemas either, that's a Bad Thing.

Maintainability

Even the best-designed schema will need to be updated eventually. If your project lives long enough, this is 100% certain. So if you're going to replace DDL, you'd better have an upgrade strategy.

But an automatic approach that only knows that you want to update from schema A to schema B is doomed to failure. [Update: I thought that sqlobject-admin took this approach, but Ian Bicking corrected me.] Say you add a column to a table: what should the value be for existing rows? Sometimes NULL. Sometimes some default value. Sometimes the value should be derived from data in other tables. There's no way to automate this for all situations, and usually sooner than later, you're back to DDL again.

Instead of spending effort on a fundamentally flawed approach, better to encourage standard best practices: the "right way" to maintain databases, that everyone who works on them enough settles on eventually, is DDL scripts, checked into source control. Old-fashioned, but if you stick to it, you'll never have a situation where you start an upgrade on your live server and run into a problem halfway through, because you've already run the exact scripts on your test server. A good ORM design accommodates this, rather than making it difficult.

Beauty

Okay, maybe DDL isn't the most beautiful creature ever birthed by a standards committee. But a lot of things are less beautiful, and those are what you get when you try to design DDL out.

  • Re-inventing the wheel is not beautiful. Like the django guys said (about templates), "don't invent a programming language." Right idea. Spend that energy doing something useful instead.
  • Violating DRY isn't beautiful. As decribed above, your users will need DDL at some point. When that happens, are you going to make their lives harder by forcing them to update their DDL-ish model in a separate .py file as well (with all the attendant possibilities for mistakes), or are you going to make them easier with an option to simply introspect the changes?

    (It's true that an ORM tool can't divine everything you want to say about your model on the Python side from the database. This is particularly true for SQLAlchemy, which lets you go beyond the simple "one table, one class" paradigm. But that's no reason to force the programmer to duplicate the parts that an ORM can and should introspect: column types, foreign keys, uniqueness and other constraints, etc.)

  • Treating the database like a slightly retarded object store is not beautiful. Even MySQL supports (simple) triggers and most constraint types these days. Allow users to realize the power afforded by a modern RDBMS. If your ORM encourages users to avoid features that, say, MySQL3 doesn't have, you're doing something wrong.

Conclusion

Avoid the temptation to re-invent the wheel. Respect your users, and allow them to continue to use industry-standard schema specification tools. Encourage using the strengths of modern relational databases instead of ignoring them. Don't require behavior that will cause your users pain down the road.

I mentioned 4 ORM tools near the beginning of this post. Here's how I see them with respect to this discussion:

  • PyDO2: The purest "let the database specify the schema" approach of the four. Supports runtime introspection. Does not generate DDL from its models; if you manually specify column types, it's probably because you only want a subset of the table's columns to show in your class.
  • SQLAlchemy: Allows generation of DDL from .py code but does not require it nor (in my reading) encourage this for production use. Robust runtime introspection.
  • SQLObject: supports runtime introspection (buggy when I used it, but may be better now). Python-based API does not support modern database features. (In a deleted comment here -- still viewable at this writing through Google Cache -- Ian Bicking says that SQLObject prefers what Martin Fowler calls an Application Database. Which as near as I can tell means that SQLObject is fine if you would be better off using an OODB; otherwise, it may be a poor choice. Perhaps the deletion indicates he's had second thoughts on this.)
  • Django: the most clearly problematic. No runtime introspection support; schema must be specified in its python-based API, which does not support modern database features. (Apparently their approach -- paraphrased -- is, "if sucky database XXY doesn't support a feature, we won't support it for anyone.") Django's ORM does have an external tool for generating .py models from an existing database, but once you start making changes, well, if you don't mind clearing data, just pipe the output of the appropriate django-admin.py sqlreset command into your database's command-line utility. Otherwise, you get to write an alter script, then manually sync up your .py model with the result.

[Dec 14 note for redditors: this post was written in Feb 2006. Some of the commentary here on specific tools is out of date.]

20 comments:

Ian Bicking said...

I've been having a problem with some of my spam deletes catching other blog comments. Damn blog. Anyway, if it went missing that doesn't really mean anything.

You might be better off using an OODB, if OODBs were transparent and robust and well understood, and they are none of those.

sqlobject-admin relies on hand-written ALTER statements, and any other SQL you want to write. This disappoints some people who want it to be more automatic. But it is not particularly magic, and gives you the flexibility to manage most aspects of an update. Certainly it should (but does not) support Python upgrade code too (which alone might be enough for the people who complain about it now).

Jonathan Ellis said...

Ah, thanks for clearing up how sqlobject-admin works for me. Perhaps now my comments on the other post make more sense, in a misguided way. I'll add a note to the original post. Sorry about that.

(I think the Durus and ZODB guys might disagree about how robust etc. their products are, though. :)

mike bayer said...

SQLAlchemy assumes two general kinds of applications:

1. the "application-in-a-box". these are all the apps that we find on freshmeat and download, like the blogs, photo galleries, bulletin boards, etc., where you want to download it, connect it to mysql/postgres/sqlite, and it creates its tables. basically, small-scale "distributed" apps. For this application, the ORM does create the tables.

2. the "we are building an app for our company/startup/client/etc." application. For this, you get a DBA or similarly-skilled person to manage the database which is done via DDL. The ORM is generally not used to create any tables. Database tools that demand you only create database schemas that correspond to some least-common-denominator are strictly ruled out.

In case #2, you might use table reflection so that the DDL's drive your app. SQLAlchemy has support for table reflection (still a TODO for oracle) strictly because there seems to be an overwhelmingly huge desire for this feature.

However, I am not a fan this feature, since the whole point of an ORM is "decoupling". The DBA can add a column "foo", and all the sudden your reflected tables throw a "foo" attribute into your classes and blow away some other usage of "foo" that was already there....creating a very mysterious long-chained cause-and-effect that is certain to ruin your night when discovered at 2 AM. That's just the kind of tight-coupling that you don't want in an enterprise environment.

Jonathan Ellis said...

Right, you don't always want all columns from table_a to show up in ClassA. I don't think I implied otherwise, but if I did, that's not what I meant. :)

The virtue of DRY in this example is, when you DO want those columns, the programmer shouldn't have to repeat "foo CharField(20)" to mimic "foo varchar(20)". If you want column foo, the ORM should introspect it... less of a PITA, and more importantly, less chance of changes being made in one place but not the other.

Tim Lesher said...

SQL DDL is a standard.

That would be true, if the RDBMS world used the word "standard" the way the rest of us did. Unfortunately, every RDBMS vendor implements a slightly different flavor of SQL, including DDL.

For example: how many ways are there to define a sequentially-increasing integer value in a schema?

IMHO, What DB vendors have done to the SQL standard makes the Microsoft/Netscape HTML wars look like a pillowfight.

Jonathan Ellis said...

While you can find examples of incompatibility -- less so all the time; the trend is towards increasing compatibility, not less -- arguing that it's okay to invent incompatible Python APIs because not all databases adhere to the ANSI standard reminds me strongly of the Perl zealot's favorite defense: "you can write bad code in any language, so Perl is really no different."

While this is literally true, practically speaking it's a hell of a lot easier to write crap Perl than to write crap Python. And SQL DDL is a hell of a lot more standard than whatever the flavor-of-the-month Python ORM has come up with.

(BTW, CREATE SEQUENCE is part of the ANSI standard these days. I believe at least PostgreSQL, Oracle, and DB2 comply. That's at least 3 times the interoperability you'll find with the Python APIs.)

Adrian Holovaty said...

Hey man,

Coupla responses to your points on Django:

1. The lack of runtime introspection is intentional because, in my opinion, the overhead is completely unnecessary. I cannot see justifying database introspection every time the Web server starts. (Not to mention the auto-pluralization voodoo that Ruby on Rails does *every* time the server starts!) Yuck, yuck, yuck.

A fair analogy: If a section on a high-traffic Web site's home page only changes every month, it'd be silly to generate that section of the page each time the page is requested. Cache that, man. Cache it.

2. Having used Django in production for more than two years, I've never had a big problem with writing ALTER TABLE statements and updating Python model code. It's a non-issue. Personally, I agree with you that automating ALTER TABLE statements is too much of a complex problem to automate, but there's still a bit of demand for that in the Django community, so we might come up with some sort of half-automated solution.

Also, you didn't mention an advantage of defining data schemas entirely in Python: Being able to recreate the schema in multiple database backends automatically. For example, with Django models I've set up quick test environments in SQLite in-memory databases.

Adrian Holovaty said...

One more quick comment: Clearly beauty is subjective. In my opinion, it's ugly to keep *some* of your data definition in SQL and *some* of it in Python. I'd much rather it be in one place -- Python.

In the eye of this beholder, keeping it all in one place is more beautiful. :)

Jonathan Ellis said...

Adrian, thanks for the response.

Performance is a fair issue to address, but is this real, customer-driven optimization? Seems to me that if your web server is restarting every couple minutes, something is seriously wrong. OTOH if the server stays up for even just hours, amortized introspection cost is basically zero. Scanning the system catalogs is FAST if done correctly.

Even if you did have a db so slow, or with so many hundreds of tables that this was a practical problem rather than a theoretical one, caching should be the ORM's job, not the programmer's. I.e., give the server an option to cache on disk and use the introspection data from the last run, rather than putting the burden of maintaining the cache on the programmer in the form of manually maintained .py models.

I'll concede that "easy db independance" has its uses. I think Mike's "application-in-a-box" classification is spot-on here. But these apps are probably better served by just sticking with an embedded db like SQLite or Durus, so it seems wrong to design an ORM around this principle vs making things easier for larger apps that are pushing the limits of the db.

Josh Berkus posted a brief presentation on db independance a while ago. It's worth a read if you haven't seen it. www.powerpostgresql.com/Downloads/database_depends_public.sxi

Jonathan Ellis said...

Well, before I address the point of beauty, it's probably worth making sure we're using the same terminology.

I've tried to use "schema" to refer to "relational model of the data" and "model" to refer to the python classes representing it.

It sounds like you're calling both of these "data definition," and saying that you can increase beauty even more by generating the schema from the python model.

But this is a false promise of beauty. You can't get away from DDL, not if you're going to use any of the features that make a relational database worth using. So you might as well embrace it instead of reinventing it poorly. And by poorly I mean, it does the job less well than DDL does in the other areas I posted about.

Ian Bicking said...

"""(I think the Durus and ZODB guys might disagree about how robust etc. their products are, though. :)"""

I'm not sure they really would. I think they might feel their products would compare well to an RDBMS-backed object database, sure -- for instance, ZODB compared to APE with an RDBMS backend. I don't think I'd argue that's substantially more robust, transparent, or better understood than ZODB (probably the opposite).

But if you are willing to constrain your application data to features that map well to an RDBMS -- and that's what all these tools encourage -- then I doubt OODB developers would disagree that the more traditional relational database beats them in terms of transparency and robust operation in the face of all the possible (and likely) errors, including application programming and deployment errors.

Ian Bicking said...

Another note: while many people think database independence isn't a big deal, or isn't a realistic goal, these people aren't writing open source software. If I didn't have a database DDL abstract layer, I'd have to write one anyway just to handle the tests. So why not offer it?

Anyway, you can use Col() in SQLObject (instead of StringCol(), etc) if you want to enumerate the columns instead of specifying them. This generally works fine, and is nearly equivalent in most cases. Some columns, like PickleCol, do extra work on the Python side. Others, like BoolCol, convert Python boolean semantics to the database -- using BOOLEAN types on Postgres is a total pain in the butt otherwise, because Postgres is so picky about its booleans (it's a little bizarre the choices they made there).

Jonathan Ellis said...

"While many people think database independence isn't a big deal, or isn't a realistic goal, these people aren't writing open source software."

Wow, that's strikes me as having absolutely no basis in reality.

The more a developer knows how to use a database's strengths, the less likely he is to throw that away to satisfy the lowest common denominator. Open source-ness or lack thereof has nothing to do with that.

Jonathan Ellis said...

... if anything, commercial software is *more* motivated to code for db independance, because if a customer insists on database XYZ -- and some IT departments most certainly do -- that's one less potential sale.

Ian Bicking said...

Sorry, I'll clarify -- it they think database portability doesn't matter then they probably aren't writing viable open source libraries or applications that are likely to garner a substantial base of users. Unless they are writing to MySQL (or maybe SQLite), in which case they have a chance, because those represent close to the lowest common denominator, and so database portability can be applied later.

And maybe the same is true of commercial products (though often not, from what I've seen, but it depends on the market and how it is likely to be deployed). Many of the people who think database portability is a waste of time are people who are doing custom development, where the target database is known and fixed. To them, database portability means that someone is trying to hedge there bets in case there is a future move to another database. That's probably not a good hedge to make, but that's only one use case for portability.

Also, in the paste few years, SQLite has opened up an opportunity to embed your database, in the way non-portable commercial products often embed a database (either literally embedding a database, or presenting a unified product and deployment process that psychologically embeds it). I've seen that cause some problems for open source projects when they later try to move to a better performing backend; but it doesn't stop them from getting off the ground like other database dependencies can.

Wade Leftwich said...

I'm using Sqlalchemy to build a slightly complicated (13 tables and counting) application. At some point I expect to be writing DDL, but at this point I'm getting a productivity boost defining my tables in the same language as my classes. When it comes time to tweak the constraints and define the indexes, I'll have a nice skeleton to work with.

So I don't think it's either DDL or the Python tool. DRY is a worthy aspiration, but some repetition is unavoidable, since at some level your application has to know something about your database structure. SqlObject and Sqlalchemy (the only two I have looked at) help you keep all that database knowledge together in one place in an organized way.

Jacob Smullyan said...

A couple of points.

1) Writing database-independent apps is really a special, if important and growing, niche in the open-source application market. I think it is still true that most apps are private affairs written for clients with particular needs; database independence is not usually one of those needs, as Ian pointed out, and permitting other applications, possibly written in other languages, to access the same database in a safe and meaningful manner, often is (although perhaps less often than it was). Also, many applications need to be written against a pre-existing database schema. Some ORMs aim to work in such projects (PyDO does); others are better suited for new projects (SQLObject, I believe).

2) Although I don't myself use the introspective capabilities of PyDO -- I prefer explicit to DRY in this area -- when I implemented it for PyDO I did precisely what Jonathan mentions here -- the results are by default cached, so multiple runs/processes won't be bogged down in repetitive introspection queries. That makes introspection more tolerable in practice.

3) The religious war about whether, in the abstract, logic should be in the db/apps written for specific databases or against a higher-level abstraction is hereby declared won by (n)either side; anyone who perpetuates this pointless argument should be pickled, inserted into a blob, and rolled back. (Lucky for you, it was an ISAM table!)

Simon Willison said...

There's a lot to be said for the convenience factor of having your database tables derived from your Python models. Django makes it really easy to work with large numbers of database tables - sites like www.lawrence.com can have upwards of 200 tables due to the immense amount of detailed data and metadata they contain. Not having to hand craft your DDL and then map it to your models can have a dramatic effect on your productivity.

Of course, what's true for a relatively straight forward data driven website isn't necessarily true for a deeply relational reporting application - which is why Django doesn't force you to use its ORM. If you need detailed control over your schema you can drop in something like SQLAlchemy or code against a low level DB API.

Jonathan Ellis said...

For the reasons I gave, it's better to implement DRY by introspecting the database than by generating DDL from Python.

James Bennett said...

While I sympathize with a lot of what you're saying here -- I've harped for a while on the fact that ORMs aren't and shouldn't be considered replacements for SQL -- I think there are some things which need to be pointed out in response to your claims about Django:

1. Resetting the database to do migrations: it's one technique, not the only technique. I use that all the time when I'm mocking things up (though usually I'm doing it on SQLite, so I just delete the DB file. Once I've stabilized my schema, I move to a real database and write ALTER TABLE statements if I need to change things). I go back and forth on whether it's a bad idea for frameworks to try to make schema changes easy -- in my experience, that encourages bad database design (or lack of any database design at all).

2. "Modern database features": it's a tradeoff. You can have compatibility and portability, or you can have the latest features from FooRDBMS. Unfortunately, this is the reality on the ground right now; yes, things are improving, yes we have hope for the future, but at the present moment the idea of "compatible DDL" is still just an idea. We've been forced into the nasty situation of having to save some DBs from themselves by moving bits of referential integrity checking into the framework layer because of this -- MySQL in particular is problematic, since the default settings on most low- and mid-level hosting setups rule out having the DB check foreign-key constraints for us. Meanwhile, nothing stops you, right now, from using various advanced features with Django -- the framework will be more than happy to spit out a DDL file for you, tailored to the RDBMS you've specified, which you can tweak to your heart's content before applying it. Triggers, constraints, you name it -- you can have it. I know of people who are doing that right now.

Defining the data model -- as much of it as possible, at least -- in Python is another tradeoff; you may not think it's as pure or as elegant as writing DDL, but it makes Django applications much more portable and much easier to distribute. And, in the Python world, practicality famously trumps purity.

3. "No runtime introspection": runtime introspection to get the schema is going to be problematic at best, thanks to MySQL. And, unfortunately, MySQL has the lion's share of an important market for frameworks like Django (this is, I suspect, why Rails will introspect to get the list of columns and data types, but needs you to manually annotate model classes with data about relationships). So, as far as I can make out, introspection would only be useful for detecting, at runtime, that the schema has changed. And that should never happen; if the schema doesn't match what the application expects, the application should be throwing errors. Result: Django offers the ability to introspect up-front as you write the application, which is where introspection is actually useful.

4. The "Application Database": like it or not, most ORMs in popular web frameworks today are largely just persistence mechanisms, not the full-fledged first-class relational calculus SQL people have come to know and love. Pretty much all of them run into impedance mismatch sooner or later because of this (most notoriously when it comes to persisting hierarchies of derived classes), but it's really about the best option available right now. Object databases aren't there yet in terms of availability (Dreamhost and other hosts in its tier aren't offering object databases, and probably won't for a long time, if ever), even if they are ready in terms of functionality and ease of use. Django's ORM goes out of its way not to be too tied to the idea of SQL as the backend (see some of the recent discussion on proposals for aggregates for hints of this), but for now an RDBMS makes a pretty decent object store.