Monday, November 03, 2008

An unusual approach to log parsing

I saw an interesting article about logging today on reddit, and it struck a nerve with me, specifically how most text logs are not designed for easy parsing.  (I don't agree with the second point, though -- sometimes logging is tracing, or perhaps more accurately, sometimes tracing is logging.)

We had a lot of log and trace data at Mozy, several GB per day.  A traditional parsing approach would have been tedious and prone to regressions when the messages generated by the server changed.  So Paul Cannon, a frustrated lisp programmer, designed a system where the log API looked something like this:

self.log('command_reply(%r, %r)' % (command, arg))

Then the log processor would define the vocabulary (command_reply, etc.) and, instead of parsing the log messages, eval them!   This is an approach that wouldn't have occurred to me, nor would I have thought of using partial function application to simplify passing state (from the log processor and/or previous log entries) to these functions.  (e.g., the entry for command_reply in the eval namespace might be 'command_reply': partial(self.command_reply, db_cursor, thread_id))

There are drawbacks to this approach; perhaps the largest is that this works best in homogeneous systems.  Python's repr function (invoked by the %r formatting character) is great at taking care of any quoting issues necessary when dealing with Python primitives, as well as custom objects with some help from the programmer.  But when we started having a C++ system also log messages to this system, it took them several tries to fix all the corner cases involved in generating messages that were valid python code.

On balance, I think this un-parsing approach was a huge win, and as the first application of "code is data" that made more than theoretical sense to me it was a real "eureka!" moment.

6 comments:

Anonymous said...

Not to start a flame war, but this is interesting, (I have a patch for Windows logs):

http://freshmeat.net/projects/epylog/

kb@penguinpackets.com

Anonymous said...

Have you considered using OLAP for logging? It is optimized for writing data.

hitesh said...

... don't forget to consider malicious code injection.

Jonathan Ellis said...

hitesh: using repr prevents any such problems.

Paddy3118 said...

Hi Jonathan,
A problem when you write 'eval'-able logs is that I find that you never know at time of writing, what program/language/script is going to have to read your log in the future.

On Unix environments, where I tend to do such work, I have found that useful tests are to think:

* "Is this log easily manipulated by an AWK script".

* "Is the record and field structure of this log easily identifiable without the source to the generator or other documentation".

Easy AWKability is a good test (if you know AWK), as AWK is a fairly simple scripting language, one step up from good Unix shells such as bash. If you learn AWK you will know that it likes one record equals one line, space separated fields within a line, and first field(s) showing the type of the record. Most other programming languages find it easy to handle logs of this form too.

To satisfy my second point I sometimes see, and use myself, comment records. Comment records usually start with a distinctive field name such as '#' and I usually restrict them to the head of a file. Just because it is a comment record, it might still NOT be a free-form line of text after the initial field specifier, as such comment fields can be used for parse-able header information.

- Paddy.

Gregg Lind said...

I've become a big fan of json for things like this as well.

Usually, the output message is some sort of dict-like thing anyway (with keys: TYPE, MESSAGE, TIME and the like), and json makes it easy to print that out in an cross-program-readable way (since we have perl, python, and god knows what-else looking at them).

This fails the AWK-able test, but it's still grep-able. It certainly is more adaptable than any other delimited solution I came up with, and much easier to handle strings with than many other delimited formats.