Tuesday, April 12, 2005

plpython intro

My current employer keeps a lot of application metadata as xml in the database, and I needed to make a simple update to some of that. Writing a client program to do this would be immense overkill, as well as not playing nicely with my database auto-upgrade script. The update could be easily handled with a regular expression substitution, but although postgresql has a decent regular expression implementation built-in, it has no function to do replacement with them. [Update: regex_replace was added in version 8.1, six months after this post.)

An underused feature of PostgreSQL is its ability to define functions in just about any language you like. Java, Perl, Python, TCL, among others, as well as PL/PGSQL. Most developers, unfortunately, are a bit leery of defining custom functions in the database. Whether this is because of the learning curve (which turns out to be quite shallow), or because they are too used to inferior databases that don't allow such things, I couldn't say.

Anyway, creating a function that did what I needed was trivial.

-- pattern, original, replacement, flags
    import re
    flags = 0
    for flagchar in args[3]:
        flags |= getattr(re, flagchar)
    p = re.compile(args[0], flags)
    return re.sub(p, args[2], args[1])
' LANGUAGE 'plpythonu'

-- pattern, original, replacement
    select regsub($1, $2, $3, '''');
' LANGUAGE 'sql'
Some things to note:
  • Postgresql 8.0 supports giving function variables names in the function definition statement, but the plpython handler doesn't know about this. (Anyone want to fix this? I just don't have the time.) So I'm stuck with the old-style args list.
  • Callers can't access the re namespace to grab the contants out as ints, so flags are passed as a concatenated string ('IS' = IGNORECASE + DOTALL, etc.) and turned into an appropriate int with getattr.
  • Speaking of that, this is the only time I have ever used the string iterator in python. I figured that after getting burned by that so many times, I might as well get some use out of it for once!
  • Postgresql 8.0 added the "dollar-quoting" feature to mitigate quoting headaches when defining functions. Unfortunately our production server will be running 7.4 for the forseeable future, so I stuck with escaping my quotes in the second function the old-fashioned way.
  • The second function isn't plpython: only SQL and PL/PGSQL functions can call other database functions directly; everything else is restricted to their handler's namespace. Although plpython gives you a way to dip back into the database, the overhead of doing this isn't negligible, and for something simple like this it's easier to just declare an SQL function anyway.
  • If you don't have plpython installed in your master template, chances are you'll need to run createlang from the commandline (or CREATE LANGUAGE from psql) before trying this at home.


Farce Pest said...

Inferior, eh?

MySQL Reference Manual :: 27.2 Adding New Functions to MySQL

This has been around since the 4.0 series, so it's not exactly new.

Jonathan Ellis said...

If you find writing MySQL UDF in C/C++ user-friendly and convenient, then have at it. :)

Jonathan Ellis said...

... please feel free to rewrite the regsub I posted as a MySQL UDF as a counterpoint. I think that would be most instructive. (It's only fair to use Boost regex or whatever your weapon of choice is, since python has that built in.)

Tim Lesher said...

On the other hand, were you both a MySQL fanatic and a Python fanatic, I suppose you could attempt a SWIG or boost::python hack. But why?

Hamish said...

Might I suggest using meaningful names in the code rather than indexing args?:

pattern, original, replacement, flags = args

Jonathan Ellis said...

You're right, of course -- but how much urgency would fixing the plpython handler have then? :)

Anonymous said...

Some developers are leery because of a philosophy of separating code from data. There should be one place for procedures (python code, usually in a directory full of .py files) and one place for SQL commands (usually some other directory of .sql files, or *well-documented* places inside the .py files)

Mixing is Bad.

Jonathan Ellis said...

When you get more experienced, you prefer to have a separation along the lines of: data manipulations belongs in the db; app logic belongs in your external process.

Anonymous said...

That is so great!
It's also possible to run python in parallel on SMP: Parallel Python

Echo said...

This is exactly what I was looking for! Thanks so much.


Anonymous said...

Rock on, Ellis. I've been evaluating web application frameworks for ages and am not happy with the way any of them separate the model, view, and controller.

IMHO, if the model involves a DBMS then all model related work should be handled by the DBMS.

Xual said...

Great example.

While separating code from data is a good philosophy I have to say stuff like this is very relevant. You simply do not get to choose some of the projects you inherit and its never as straight forward as you want it to be.

If we all used SQL alchemy or the likes we certainly could separate our code and data very easily and perform all kinds of magic when we want.

Well done Mr Ellis.

Anonymous said...

This was EXACTLY what I needed for figuring out how to call a plpython function from another function.

Pretty great this is still a relevant post 7 years later!