I needed to scan some pretty massive gzipped text files, so my first try was the obvious "for line in gzip.open(...)." This worked but seemed way slower than expected. So I wrote "pyzcat" as a test and ran it against a file with 100k lines:
#!/usr/bin/python import sys, gzip for fname in sys.argv[1:]: for line in gzip.open(fname): print line,
$ time zcat testzcat.gz > /dev/null real 0m0.329s $ time ./pyzcat testzcat.gz > /dev/null real 0m3.792s
10x slower -- ouch! Well, if zcat is so much better, let's try using zcat to do the reads for us:
def gziplines(fname): from subprocess import Popen, PIPE f = Popen(['zcat', fname], stdout=PIPE) for line in f.stdout: yield line for fname in sys.argv[1:]: for line in gziplines(fname): print line,
$ time ./pyzcat2 testzcat.gz |wc real 0m0.750s
So, reading from a zcat subprocess is 5x faster than using the gzip module. cGzipFile anyone?