My current employer keeps a lot of application metadata as xml in the database, and I needed to make a simple update to some of that. Writing a client program to do this would be immense overkill, as well as not playing nicely with my database auto-upgrade script. The update could be easily handled with a regular expression substitution, but although postgresql has a decent regular expression implementation built-in, it has no function to do replacement with them. [Update: regex_replace was added in version 8.1, six months after this post.)
An underused feature of PostgreSQL is its ability to define functions in just about any language you like. Java, Perl, Python, TCL, among others, as well as PL/PGSQL. Most developers, unfortunately, are a bit leery of defining custom functions in the database. Whether this is because of the learning curve (which turns out to be quite shallow), or because they are too used to inferior databases that don't allow such things, I couldn't say.
Anyway, creating a function that did what I needed was trivial.
-- pattern, original, replacement, flags
CREATE OR REPLACE FUNCTION regsub (VARCHAR, VARCHAR, VARCHAR, VARCHAR)
RETURNS VARCHAR AS '
import re
flags = 0
for flagchar in args[3]:
flags |= getattr(re, flagchar)
p = re.compile(args[0], flags)
return re.sub(p, args[2], args[1])
' LANGUAGE 'plpythonu'
WITH (ISCACHABLE);
-- pattern, original, replacement
CREATE OR REPLACE FUNCTION regsub (VARCHAR, VARCHAR, VARCHAR) RETURNS VARCHAR AS '
select regsub($1, $2, $3, '''');
' LANGUAGE 'sql'
WITH (ISCACHABLE);
Some things to note:
- Postgresql 8.0 supports giving function variables names in the function definition statement, but the plpython handler doesn't know about this. (Anyone want to fix this? I just don't have the time.) So I'm stuck with the old-style args list.
- Callers can't access the re namespace to grab the contants out as ints, so flags are passed as a concatenated string ('IS' = IGNORECASE + DOTALL, etc.) and turned into an appropriate int with getattr.
- Speaking of that, this is the only time I have ever used the string iterator in python. I figured that after getting burned by that so many times, I might as well get some use out of it for once!
- Postgresql 8.0 added the "dollar-quoting" feature to mitigate quoting headaches when defining functions. Unfortunately our production server will be running 7.4 for the forseeable future, so I stuck with escaping my quotes in the second function the old-fashioned way.
- The second function isn't plpython: only SQL and PL/PGSQL functions can call other database functions directly; everything else is restricted to their handler's namespace. Although plpython gives you a way to dip back into the database, the overhead of doing this isn't negligible, and for something simple like this it's easier to just declare an SQL function anyway.
- If you don't have plpython installed in your master template, chances are you'll need to run createlang from the commandline (or CREATE LANGUAGE from psql) before trying this at home.

9 comments:
Inferior, eh?
MySQL Reference Manual :: 27.2 Adding New Functions to MySQL
This has been around since the 4.0 series, so it's not exactly new.
If you find writing MySQL UDF in C/C++ user-friendly and convenient, then have at it. :)
... please feel free to rewrite the regsub I posted as a MySQL UDF as a counterpoint. I think that would be most instructive. (It's only fair to use Boost regex or whatever your weapon of choice is, since python has that built in.)
On the other hand, were you both a MySQL fanatic and a Python fanatic, I suppose you could attempt a SWIG or boost::python hack. But why?
Might I suggest using meaningful names in the code rather than indexing args?:
pattern, original, replacement, flags = args
You're right, of course -- but how much urgency would fixing the plpython handler have then? :)
Some developers are leery because of a philosophy of separating code from data. There should be one place for procedures (python code, usually in a directory full of .py files) and one place for SQL commands (usually some other directory of .sql files, or *well-documented* places inside the .py files)
Mixing is Bad.
When you get more experienced, you prefer to have a separation along the lines of: data manipulations belongs in the db; app logic belongs in your external process.
That is so great!
It's also possible to run python in parallel on SMP: Parallel Python
Post a Comment