Skip to main content

An unusual approach to log parsing

I saw an interesting article about logging today on reddit, and it struck a nerve with me, specifically how most text logs are not designed for easy parsing.  (I don't agree with the second point, though -- sometimes logging is tracing, or perhaps more accurately, sometimes tracing is logging.)

We had a lot of log and trace data at Mozy, several GB per day.  A traditional parsing approach would have been tedious and prone to regressions when the messages generated by the server changed.  So Paul Cannon, a frustrated lisp programmer, designed a system where the log API looked something like this:

self.log('command_reply(%r, %r)' % (command, arg))

Then the log processor would define the vocabulary (command_reply, etc.) and, instead of parsing the log messages, eval them!   This is an approach that wouldn't have occurred to me, nor would I have thought of using partial function application to simplify passing state (from the log processor and/or previous log entries) to these functions.  (e.g., the entry for command_reply in the eval namespace might be 'command_reply': partial(self.command_reply, db_cursor, thread_id))

There are drawbacks to this approach; perhaps the largest is that this works best in homogeneous systems.  Python's repr function (invoked by the %r formatting character) is great at taking care of any quoting issues necessary when dealing with Python primitives, as well as custom objects with some help from the programmer.  But when we started having a C++ system also log messages to this system, it took them several tries to fix all the corner cases involved in generating messages that were valid python code.

On balance, I think this un-parsing approach was a huge win, and as the first application of "code is data" that made more than theoretical sense to me it was a real "eureka!" moment.

Comments

Anonymous said…
Not to start a flame war, but this is interesting, (I have a patch for Windows logs):

http://freshmeat.net/projects/epylog/

kb@penguinpackets.com
Anonymous said…
Have you considered using OLAP for logging? It is optimized for writing data.
hitesh said…
... don't forget to consider malicious code injection.
Jonathan Ellis said…
hitesh: using repr prevents any such problems.
Paddy3118 said…
Hi Jonathan,
A problem when you write 'eval'-able logs is that I find that you never know at time of writing, what program/language/script is going to have to read your log in the future.

On Unix environments, where I tend to do such work, I have found that useful tests are to think:

* "Is this log easily manipulated by an AWK script".

* "Is the record and field structure of this log easily identifiable without the source to the generator or other documentation".

Easy AWKability is a good test (if you know AWK), as AWK is a fairly simple scripting language, one step up from good Unix shells such as bash. If you learn AWK you will know that it likes one record equals one line, space separated fields within a line, and first field(s) showing the type of the record. Most other programming languages find it easy to handle logs of this form too.

To satisfy my second point I sometimes see, and use myself, comment records. Comment records usually start with a distinctive field name such as '#' and I usually restrict them to the head of a file. Just because it is a comment record, it might still NOT be a free-form line of text after the initial field specifier, as such comment fields can be used for parse-able header information.

- Paddy.
Gregg Lind said…
I've become a big fan of json for things like this as well.

Usually, the output message is some sort of dict-like thing anyway (with keys: TYPE, MESSAGE, TIME and the like), and json makes it easy to print that out in an cross-program-readable way (since we have perl, python, and god knows what-else looking at them).

This fails the AWK-able test, but it's still grep-able. It certainly is more adaptable than any other delimited solution I came up with, and much easier to handle strings with than many other delimited formats.

Popular posts from this blog

Why schema definition belongs in the database

Earlier, I wrote about how ORM developers shouldn't try to re-invent SQL . It doesn't need to be done, and you're not likely to end up with an actual improvement. SQL may be designed by committee, but it's also been refined from thousands if not millions of man-years of database experience. The same applies to DDL. (Data Definition Langage -- the part of the SQL standard that deals with CREATE and ALTER.) Unfortunately, a number of Python ORMs are trying to replace DDL with a homegrown Python API. This is a Bad Thing. There are at least four reasons why: Standards compliance Completeness Maintainability Beauty Standards compliance SQL DDL is a standard. That means if you want something more sophisticated than Emacs, you can choose any of half a dozen modeling tools like ERwin or ER/Studio to generate and edit your DDL. The Python data definition APIs, by contrast, aren't even compatibile with other Python tools. You can't take a table definition

Python at Mozy.com

At my day job, I write code for a company called Berkeley Data Systems. (They found me through this blog, actually. It's been a good place to work.) Our first product is free online backup at mozy.com . Our second beta release was yesterday; the obvious problems have been fixed, so I feel reasonably good about blogging about it. Our back end, which is the most algorithmically complex part -- as opposed to fighting-Microsoft-APIs complex, as we have to in our desktop client -- is 90% in python with one C extension for speed. We (well, they, since I wasn't at the company at that point) initially chose Python for speed of development, and it's definitely fulfilled that expectation. (It's also lived up to its reputation for readability, in that the Python code has had 3 different developers -- in serial -- with very quick ramp-ups in each case. Python's succinctness and and one-obvious-way-to-do-it philosophy played a big part in this.) If you try it out, pleas

A review of 6 Python IDEs

(March 2006: you may also be interested the updated review I did for PyCon -- http://spyced.blogspot.com/2006/02/pycon-python-ide-review.html .) For September's meeting, the Utah Python User Group hosted an IDE shootout. 5 presenters reviewed 6 IDEs: PyDev 0.9.8.1 Eric3 3.7.1 Boa Constructor 0.4.4 BlackAdder 1.1 Komodo 3.1 Wing IDE 2.0.3 (The windows version was tested for all but Eric3, which was tested on Linux. Eric3 is based on Qt, which basically means you can't run it on Windows unless you've shelled out $$$ for a commerical Qt license, since there is no GPL version of Qt for Windows. Yes, there's Qt Free , but that's not exactly production-ready software.) Perhaps the most notable IDEs not included are SPE and DrPython. Alas, nobody had time to review these, but if you're looking for a free IDE perhaps you should include these in your search, because PyDev was the only one of the 3 free ones that we'd consider using. And if you aren