Skip to main content

An unusual approach to log parsing

I saw an interesting article about logging today on reddit, and it struck a nerve with me, specifically how most text logs are not designed for easy parsing.  (I don't agree with the second point, though -- sometimes logging is tracing, or perhaps more accurately, sometimes tracing is logging.)

We had a lot of log and trace data at Mozy, several GB per day.  A traditional parsing approach would have been tedious and prone to regressions when the messages generated by the server changed.  So Paul Cannon, a frustrated lisp programmer, designed a system where the log API looked something like this:

self.log('command_reply(%r, %r)' % (command, arg))

Then the log processor would define the vocabulary (command_reply, etc.) and, instead of parsing the log messages, eval them!   This is an approach that wouldn't have occurred to me, nor would I have thought of using partial function application to simplify passing state (from the log processor and/or previous log entries) to these functions.  (e.g., the entry for command_reply in the eval namespace might be 'command_reply': partial(self.command_reply, db_cursor, thread_id))

There are drawbacks to this approach; perhaps the largest is that this works best in homogeneous systems.  Python's repr function (invoked by the %r formatting character) is great at taking care of any quoting issues necessary when dealing with Python primitives, as well as custom objects with some help from the programmer.  But when we started having a C++ system also log messages to this system, it took them several tries to fix all the corner cases involved in generating messages that were valid python code.

On balance, I think this un-parsing approach was a huge win, and as the first application of "code is data" that made more than theoretical sense to me it was a real "eureka!" moment.

Comments

Anonymous said…
Not to start a flame war, but this is interesting, (I have a patch for Windows logs):

http://freshmeat.net/projects/epylog/

kb@penguinpackets.com
Anonymous said…
Have you considered using OLAP for logging? It is optimized for writing data.
hitesh said…
... don't forget to consider malicious code injection.
Jonathan Ellis said…
hitesh: using repr prevents any such problems.
Paddy3118 said…
Hi Jonathan,
A problem when you write 'eval'-able logs is that I find that you never know at time of writing, what program/language/script is going to have to read your log in the future.

On Unix environments, where I tend to do such work, I have found that useful tests are to think:

* "Is this log easily manipulated by an AWK script".

* "Is the record and field structure of this log easily identifiable without the source to the generator or other documentation".

Easy AWKability is a good test (if you know AWK), as AWK is a fairly simple scripting language, one step up from good Unix shells such as bash. If you learn AWK you will know that it likes one record equals one line, space separated fields within a line, and first field(s) showing the type of the record. Most other programming languages find it easy to handle logs of this form too.

To satisfy my second point I sometimes see, and use myself, comment records. Comment records usually start with a distinctive field name such as '#' and I usually restrict them to the head of a file. Just because it is a comment record, it might still NOT be a free-form line of text after the initial field specifier, as such comment fields can be used for parse-able header information.

- Paddy.
Gregg Lind said…
I've become a big fan of json for things like this as well.

Usually, the output message is some sort of dict-like thing anyway (with keys: TYPE, MESSAGE, TIME and the like), and json makes it easy to print that out in an cross-program-readable way (since we have perl, python, and god knows what-else looking at them).

This fails the AWK-able test, but it's still grep-able. It certainly is more adaptable than any other delimited solution I came up with, and much easier to handle strings with than many other delimited formats.
Kendra Dolan said…
I find the idea of using eval for log processing very innovative.

Popular posts from this blog

A week of Windows Subsystem for Linux

I first experimented with WSL2 as a daily development environment two years ago. Things were still pretty rough around the edges, especially with JetBrains' IDEs, and I ended up buying a dedicated Linux workstation so I wouldn't have to deal with the pain.  Unfortunately, the Linux box developed a heat management problem, and simultaneously I found myself needing a beefier GPU than it had for working on multi-vector encoding , so I decided to give WSL2 another try. Here's some of the highlights and lowlights. TLDR, it's working well enough that I'm probably going to continue using it as my primary development machine going forward. The Good NVIDIA CUDA drivers just work. I was blown away that I ran conda install cuda -c nvidia and it worked the first try. No farting around with Linux kernel header versions or arcane errors from nvidia-smi. It just worked, including with PyTorch. JetBrains products work a lot better now in remote development mod...

Python at Mozy.com

At my day job, I write code for a company called Berkeley Data Systems. (They found me through this blog, actually. It's been a good place to work.) Our first product is free online backup at mozy.com . Our second beta release was yesterday; the obvious problems have been fixed, so I feel reasonably good about blogging about it. Our back end, which is the most algorithmically complex part -- as opposed to fighting-Microsoft-APIs complex, as we have to in our desktop client -- is 90% in python with one C extension for speed. We (well, they, since I wasn't at the company at that point) initially chose Python for speed of development, and it's definitely fulfilled that expectation. (It's also lived up to its reputation for readability, in that the Python code has had 3 different developers -- in serial -- with very quick ramp-ups in each case. Python's succinctness and and one-obvious-way-to-do-it philosophy played a big part in this.) If you try it out, pleas...

Why PHP sucks

(July 8 2005) Apparently I got linked by some PHP sites, and while there were a few well-reasoned comments here I mostly just got people who only knew PHP reacting like I told them their firstborn was ugly. These people tended to give variants on one or more themes: All environments have warts, so PHP is no worse than anything else in this respect I can work around PHP's problems, ergo they are not really problems You aren't experienced enough in PHP to judge it yet As to the first, it is true that PHP is not alone in having warts. However, the lack of qualitative difference does not mean that the quantitative difference is insignificant. Similarly, problems can be worked around, but languages/environments designed by people with more foresight and, to put it bluntly, clue, simply don't make the kind of really boneheaded architecture mistakes that you can't help but run into on a daily baisis in PHP. Finally, as I noted in my original introduction, with PHP, ...