I needed to scan some pretty massive gzipped text files, so my first try was the obvious "for line in gzip.open(...)." This worked but seemed way slower than expected. So I wrote "pyzcat" as a test and ran it against a file with 100k lines:
#!/usr/bin/python
import sys, gzip
for fname in sys.argv[1:]:
for line in gzip.open(fname):
print line,
Results:
$ time zcat testzcat.gz > /dev/null real 0m0.329s $ time ./pyzcat testzcat.gz > /dev/null real 0m3.792s
10x slower -- ouch! Well, if zcat is so much better, let's try using zcat to do the reads for us:
def gziplines(fname):
from subprocess import Popen, PIPE
f = Popen(['zcat', fname], stdout=PIPE)
for line in f.stdout:
yield line
for fname in sys.argv[1:]:
for line in gziplines(fname):
print line,
Results:
$ time ./pyzcat2 testzcat.gz |wc real 0m0.750s
So, reading from a zcat subprocess is 5x faster than using the gzip module. cGzipFile anyone?

8 comments:
nice to know some else has found the same issue. I got someone to write me a C library to deal with the gzip files I am working with but it is targeted at arc files (as used by the internet archive) and is not general enough for a cGzipFile. Would be nice to see this fixed in the core library though
gzip was made faster for python 2.5 (something about 30%-40% speed improvements in gzip.readline)
Note that the gzip module is already implemented in C and it calls libz for the actual work.
Just increase the block size in which you do I/O to 1 MB and the performance will be close.
#pyzcat-large-block:
#!/usr/bin/env python
import sys
import gzip
BLOCK_SIZE = 2**20
f = gzip.open(sys.argv[1])
for i in iter(lambda: f.read(BLOCK_SIZE), ''):
sys.stdout.write(i)
# timing results on a 150 MB gzip file
$ /usr/bin/time zcat t.gz > /dev/null
2.12user 0.11system 0:02.68elapsed 83%CPU (0avgtext+0avgdata 0maxresident)k
/usr/bin/time ./pyzcat t.gz > /dev/null
2.40user 0.57system 0:03.02elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (2major+90189minor)pagefaults 0swaps
libz is C, but the gzip module is python.
feeding your 1MB chunks to a cStringIO for turning into individual lines might speed things up vs the python code in GzipFile, but it's going to take some work when lines span your chunks.
Your comparison of zcat and pyzcat are testing completely different things. If you make pyzcat do what zcat does you'll find the timing isn't that different.
zcat doesn't give you a line-oriented interface but your pyzcat is doing buffering and scanning to make that happen.
#!/usr/bin/python
import sys, gzip, shutil
for fname in sys.argv[1:]:
shutil.copyfileobj(gzip.open(fname), sys.stdout))
That gives me 0.369s for zcat and 0.530 for a more similar pyzcat.
Still, the point that GzipFile's readline is hella slow vs zcat via subprocess remains.
of course, zcat does NOT search for newlines and split lines. You can't really compare orange and apples.
I'm comparing code to generate lines via GzipFile to code to generate lines via subprocess + zcat. That is certainly apples to apples.
I guess most people commenting here didn't read past the first block of code.
Post a Comment