Monday, February 06, 2006

Why schema definition belongs in the database

Earlier, I wrote about how ORM developers shouldn't try to re-invent SQL. It doesn't need to be done, and you're not likely to end up with an actual improvement. SQL may be designed by committee, but it's also been refined from thousands if not millions of man-years of database experience.

The same applies to DDL. (Data Definition Langage -- the part of the SQL standard that deals with CREATE and ALTER.) Unfortunately, a number of Python ORMs are trying to replace DDL with a homegrown Python API. This is a Bad Thing. There are at least four reasons why:

  • Standards compliance
  • Completeness
  • Maintainability
  • Beauty

Standards compliance

SQL DDL is a standard. That means if you want something more sophisticated than Emacs, you can choose any of half a dozen modeling tools like ERwin or ER/Studio to generate and edit your DDL.

The Python data definition APIs, by contrast, aren't even compatibile with other Python tools. You can't take a table definition from SQLObject, PyDO2, Django, or SQLAlchemy, and use it with any of the others.

A quote from the django site:

Our philosophy is that the model (in Python code) is the definitive place to store all model logic.

Yes, application logic belongs in application code. But the definitive source for the schema definition should be the database, unless you're using an object database like Durus or ZODB. Of course, the reason those (and OODBs in general) haven't taken off is that except for very simple applications, people want to access their data from more than just their Python code. So respect that. (Or be honest and require your OODB of choice.) Encourage standards use instead of a proprietary API. Let the database be the source of truth for the schema, and use standard tools to define it.

Completeness

Another strength of relational databases is the ability to define data integrity rules in a declarative fashion. It's well-understood by now that declarative code is much less error prone (and quicker to write) than the procedural equivalents. So if you're going to re-invent DDL, you need to support ON DELETE and ON UPDATE clauses. You need to support REFERENCES. You need to support CHECK constraints. You need to support defining triggers.

I don't know any ORM tool that allows doing all this from Python. So selling your tool as "you don't need to know DDL, just use our API" isn't doing users any favors.

It's okay to say, "here's our API; it lets you do a subset of what your database can handle so you can jump right in and get started without learning DDL." There's no harm in this. But when your tool is designed such that it doesn't expose the power of your database, but it doesn't really work with externally defined schemas either, that's a Bad Thing.

Maintainability

Even the best-designed schema will need to be updated eventually. If your project lives long enough, this is 100% certain. So if you're going to replace DDL, you'd better have an upgrade strategy.

But an automatic approach that only knows that you want to update from schema A to schema B is doomed to failure. [Update: I thought that sqlobject-admin took this approach, but Ian Bicking corrected me.] Say you add a column to a table: what should the value be for existing rows? Sometimes NULL. Sometimes some default value. Sometimes the value should be derived from data in other tables. There's no way to automate this for all situations, and usually sooner than later, you're back to DDL again.

Instead of spending effort on a fundamentally flawed approach, better to encourage standard best practices: the "right way" to maintain databases, that everyone who works on them enough settles on eventually, is DDL scripts, checked into source control. Old-fashioned, but if you stick to it, you'll never have a situation where you start an upgrade on your live server and run into a problem halfway through, because you've already run the exact scripts on your test server. A good ORM design accommodates this, rather than making it difficult.

Beauty

Okay, maybe DDL isn't the most beautiful creature ever birthed by a standards committee. But a lot of things are less beautiful, and those are what you get when you try to design DDL out.

  • Re-inventing the wheel is not beautiful. Like the django guys said (about templates), "don't invent a programming language." Right idea. Spend that energy doing something useful instead.
  • Violating DRY isn't beautiful. As decribed above, your users will need DDL at some point. When that happens, are you going to make their lives harder by forcing them to update their DDL-ish model in a separate .py file as well (with all the attendant possibilities for mistakes), or are you going to make them easier with an option to simply introspect the changes?

    (It's true that an ORM tool can't divine everything you want to say about your model on the Python side from the database. This is particularly true for SQLAlchemy, which lets you go beyond the simple "one table, one class" paradigm. But that's no reason to force the programmer to duplicate the parts that an ORM can and should introspect: column types, foreign keys, uniqueness and other constraints, etc.)

  • Treating the database like a slightly retarded object store is not beautiful. Even MySQL supports (simple) triggers and most constraint types these days. Allow users to realize the power afforded by a modern RDBMS. If your ORM encourages users to avoid features that, say, MySQL3 doesn't have, you're doing something wrong.

Conclusion

Avoid the temptation to re-invent the wheel. Respect your users, and allow them to continue to use industry-standard schema specification tools. Encourage using the strengths of modern relational databases instead of ignoring them. Don't require behavior that will cause your users pain down the road.

I mentioned 4 ORM tools near the beginning of this post. Here's how I see them with respect to this discussion:

  • PyDO2: The purest "let the database specify the schema" approach of the four. Supports runtime introspection. Does not generate DDL from its models; if you manually specify column types, it's probably because you only want a subset of the table's columns to show in your class.
  • SQLAlchemy: Allows generation of DDL from .py code but does not require it nor (in my reading) encourage this for production use. Robust runtime introspection.
  • SQLObject: supports runtime introspection (buggy when I used it, but may be better now). Python-based API does not support modern database features. (In a deleted comment here -- still viewable at this writing through Google Cache -- Ian Bicking says that SQLObject prefers what Martin Fowler calls an Application Database. Which as near as I can tell means that SQLObject is fine if you would be better off using an OODB; otherwise, it may be a poor choice. Perhaps the deletion indicates he's had second thoughts on this.)
  • Django: the most clearly problematic. No runtime introspection support; schema must be specified in its python-based API, which does not support modern database features. (Apparently their approach -- paraphrased -- is, "if sucky database XXY doesn't support a feature, we won't support it for anyone.") Django's ORM does have an external tool for generating .py models from an existing database, but once you start making changes, well, if you don't mind clearing data, just pipe the output of the appropriate django-admin.py sqlreset command into your database's command-line utility. Otherwise, you get to write an alter script, then manually sync up your .py model with the result.

[Dec 14 note for redditors: this post was written in Feb 2006. Some of the commentary here on specific tools is out of date.]