Skip to main content

Why schema definition belongs in the database

Earlier, I wrote about how ORM developers shouldn't try to re-invent SQL. It doesn't need to be done, and you're not likely to end up with an actual improvement. SQL may be designed by committee, but it's also been refined from thousands if not millions of man-years of database experience.

The same applies to DDL. (Data Definition Langage -- the part of the SQL standard that deals with CREATE and ALTER.) Unfortunately, a number of Python ORMs are trying to replace DDL with a homegrown Python API. This is a Bad Thing. There are at least four reasons why:

  • Standards compliance
  • Completeness
  • Maintainability
  • Beauty

Standards compliance

SQL DDL is a standard. That means if you want something more sophisticated than Emacs, you can choose any of half a dozen modeling tools like ERwin or ER/Studio to generate and edit your DDL.

The Python data definition APIs, by contrast, aren't even compatibile with other Python tools. You can't take a table definition from SQLObject, PyDO2, Django, or SQLAlchemy, and use it with any of the others.

A quote from the django site:

Our philosophy is that the model (in Python code) is the definitive place to store all model logic.

Yes, application logic belongs in application code. But the definitive source for the schema definition should be the database, unless you're using an object database like Durus or ZODB. Of course, the reason those (and OODBs in general) haven't taken off is that except for very simple applications, people want to access their data from more than just their Python code. So respect that. (Or be honest and require your OODB of choice.) Encourage standards use instead of a proprietary API. Let the database be the source of truth for the schema, and use standard tools to define it.

Completeness

Another strength of relational databases is the ability to define data integrity rules in a declarative fashion. It's well-understood by now that declarative code is much less error prone (and quicker to write) than the procedural equivalents. So if you're going to re-invent DDL, you need to support ON DELETE and ON UPDATE clauses. You need to support REFERENCES. You need to support CHECK constraints. You need to support defining triggers.

I don't know any ORM tool that allows doing all this from Python. So selling your tool as "you don't need to know DDL, just use our API" isn't doing users any favors.

It's okay to say, "here's our API; it lets you do a subset of what your database can handle so you can jump right in and get started without learning DDL." There's no harm in this. But when your tool is designed such that it doesn't expose the power of your database, but it doesn't really work with externally defined schemas either, that's a Bad Thing.

Maintainability

Even the best-designed schema will need to be updated eventually. If your project lives long enough, this is 100% certain. So if you're going to replace DDL, you'd better have an upgrade strategy.

But an automatic approach that only knows that you want to update from schema A to schema B is doomed to failure. [Update: I thought that sqlobject-admin took this approach, but Ian Bicking corrected me.] Say you add a column to a table: what should the value be for existing rows? Sometimes NULL. Sometimes some default value. Sometimes the value should be derived from data in other tables. There's no way to automate this for all situations, and usually sooner than later, you're back to DDL again.

Instead of spending effort on a fundamentally flawed approach, better to encourage standard best practices: the "right way" to maintain databases, that everyone who works on them enough settles on eventually, is DDL scripts, checked into source control. Old-fashioned, but if you stick to it, you'll never have a situation where you start an upgrade on your live server and run into a problem halfway through, because you've already run the exact scripts on your test server. A good ORM design accommodates this, rather than making it difficult.

Beauty

Okay, maybe DDL isn't the most beautiful creature ever birthed by a standards committee. But a lot of things are less beautiful, and those are what you get when you try to design DDL out.

  • Re-inventing the wheel is not beautiful. Like the django guys said (about templates), "don't invent a programming language." Right idea. Spend that energy doing something useful instead.
  • Violating DRY isn't beautiful. As decribed above, your users will need DDL at some point. When that happens, are you going to make their lives harder by forcing them to update their DDL-ish model in a separate .py file as well (with all the attendant possibilities for mistakes), or are you going to make them easier with an option to simply introspect the changes?

    (It's true that an ORM tool can't divine everything you want to say about your model on the Python side from the database. This is particularly true for SQLAlchemy, which lets you go beyond the simple "one table, one class" paradigm. But that's no reason to force the programmer to duplicate the parts that an ORM can and should introspect: column types, foreign keys, uniqueness and other constraints, etc.)

  • Treating the database like a slightly retarded object store is not beautiful. Even MySQL supports (simple) triggers and most constraint types these days. Allow users to realize the power afforded by a modern RDBMS. If your ORM encourages users to avoid features that, say, MySQL3 doesn't have, you're doing something wrong.

Conclusion

Avoid the temptation to re-invent the wheel. Respect your users, and allow them to continue to use industry-standard schema specification tools. Encourage using the strengths of modern relational databases instead of ignoring them. Don't require behavior that will cause your users pain down the road.

I mentioned 4 ORM tools near the beginning of this post. Here's how I see them with respect to this discussion:

  • PyDO2: The purest "let the database specify the schema" approach of the four. Supports runtime introspection. Does not generate DDL from its models; if you manually specify column types, it's probably because you only want a subset of the table's columns to show in your class.
  • SQLAlchemy: Allows generation of DDL from .py code but does not require it nor (in my reading) encourage this for production use. Robust runtime introspection.
  • SQLObject: supports runtime introspection (buggy when I used it, but may be better now). Python-based API does not support modern database features. (In a deleted comment here -- still viewable at this writing through Google Cache -- Ian Bicking says that SQLObject prefers what Martin Fowler calls an Application Database. Which as near as I can tell means that SQLObject is fine if you would be better off using an OODB; otherwise, it may be a poor choice. Perhaps the deletion indicates he's had second thoughts on this.)
  • Django: the most clearly problematic. No runtime introspection support; schema must be specified in its python-based API, which does not support modern database features. (Apparently their approach -- paraphrased -- is, "if sucky database XXY doesn't support a feature, we won't support it for anyone.") Django's ORM does have an external tool for generating .py models from an existing database, but once you start making changes, well, if you don't mind clearing data, just pipe the output of the appropriate django-admin.py sqlreset command into your database's command-line utility. Otherwise, you get to write an alter script, then manually sync up your .py model with the result.

[Dec 14 note for redditors: this post was written in Feb 2006. Some of the commentary here on specific tools is out of date.]

Popular posts from this blog

A week of Windows Subsystem for Linux

I first experimented with WSL2 as a daily development environment two years ago. Things were still pretty rough around the edges, especially with JetBrains' IDEs, and I ended up buying a dedicated Linux workstation so I wouldn't have to deal with the pain.  Unfortunately, the Linux box developed a heat management problem, and simultaneously I found myself needing a beefier GPU than it had for working on multi-vector encoding , so I decided to give WSL2 another try. Here's some of the highlights and lowlights. TLDR, it's working well enough that I'm probably going to continue using it as my primary development machine going forward. The Good NVIDIA CUDA drivers just work. I was blown away that I ran conda install cuda -c nvidia and it worked the first try. No farting around with Linux kernel header versions or arcane errors from nvidia-smi. It just worked, including with PyTorch. JetBrains products work a lot better now in remote development mod...

Python at Mozy.com

At my day job, I write code for a company called Berkeley Data Systems. (They found me through this blog, actually. It's been a good place to work.) Our first product is free online backup at mozy.com . Our second beta release was yesterday; the obvious problems have been fixed, so I feel reasonably good about blogging about it. Our back end, which is the most algorithmically complex part -- as opposed to fighting-Microsoft-APIs complex, as we have to in our desktop client -- is 90% in python with one C extension for speed. We (well, they, since I wasn't at the company at that point) initially chose Python for speed of development, and it's definitely fulfilled that expectation. (It's also lived up to its reputation for readability, in that the Python code has had 3 different developers -- in serial -- with very quick ramp-ups in each case. Python's succinctness and and one-obvious-way-to-do-it philosophy played a big part in this.) If you try it out, pleas...

Why PHP sucks

(July 8 2005) Apparently I got linked by some PHP sites, and while there were a few well-reasoned comments here I mostly just got people who only knew PHP reacting like I told them their firstborn was ugly. These people tended to give variants on one or more themes: All environments have warts, so PHP is no worse than anything else in this respect I can work around PHP's problems, ergo they are not really problems You aren't experienced enough in PHP to judge it yet As to the first, it is true that PHP is not alone in having warts. However, the lack of qualitative difference does not mean that the quantitative difference is insignificant. Similarly, problems can be worked around, but languages/environments designed by people with more foresight and, to put it bluntly, clue, simply don't make the kind of really boneheaded architecture mistakes that you can't help but run into on a daily baisis in PHP. Finally, as I noted in my original introduction, with PHP, ...