Skip to main content

Wow, the gzip module kinda sucks

I needed to scan some pretty massive gzipped text files, so my first try was the obvious "for line in gzip.open(...)." This worked but seemed way slower than expected. So I wrote "pyzcat" as a test and ran it against a file with 100k lines:

#!/usr/bin/python

import sys, gzip

for fname in sys.argv[1:]:
  for line in gzip.open(fname):
      print line,

Results:

$ time zcat testzcat.gz > /dev/null
real    0m0.329s

$ time ./pyzcat testzcat.gz > /dev/null
real    0m3.792s

10x slower -- ouch! Well, if zcat is so much better, let's try using zcat to do the reads for us:

def gziplines(fname):
  from subprocess import Popen, PIPE
  f = Popen(['zcat', fname], stdout=PIPE)
  for line in f.stdout:
      yield line

for fname in sys.argv[1:]:
  for line in gziplines(fname):
      print line,

Results:

$ time ./pyzcat2 testzcat.gz |wc
real    0m0.750s

So, reading from a zcat subprocess is 5x faster than using the gzip module. cGzipFile anyone?

Comments

junklight said…
nice to know some else has found the same issue. I got someone to write me a C library to deal with the gzip files I am working with but it is targeted at arc files (as used by the internet archive) and is not general enough for a cGzipFile. Would be nice to see this fixed in the core library though
Anonymous said…
gzip was made faster for python 2.5 (something about 30%-40% speed improvements in gzip.readline)
Anonymous said…
Note that the gzip module is already implemented in C and it calls libz for the actual work.

Just increase the block size in which you do I/O to 1 MB and the performance will be close.

#pyzcat-large-block:
#!/usr/bin/env python

import sys
import gzip

BLOCK_SIZE = 2**20

f = gzip.open(sys.argv[1])
for i in iter(lambda: f.read(BLOCK_SIZE), ''):
sys.stdout.write(i)

# timing results on a 150 MB gzip file
$ /usr/bin/time zcat t.gz > /dev/null
2.12user 0.11system 0:02.68elapsed 83%CPU (0avgtext+0avgdata 0maxresident)k

/usr/bin/time ./pyzcat t.gz > /dev/null
2.40user 0.57system 0:03.02elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (2major+90189minor)pagefaults 0swaps
Jonathan Ellis said…
libz is C, but the gzip module is python.

feeding your 1MB chunks to a cStringIO for turning into individual lines might speed things up vs the python code in GzipFile, but it's going to take some work when lines span your chunks.
Anonymous said…
Your comparison of zcat and pyzcat are testing completely different things. If you make pyzcat do what zcat does you'll find the timing isn't that different.

zcat doesn't give you a line-oriented interface but your pyzcat is doing buffering and scanning to make that happen.

#!/usr/bin/python
import sys, gzip, shutil
for fname in sys.argv[1:]:
shutil.copyfileobj(gzip.open(fname), sys.stdout))

That gives me 0.369s for zcat and 0.530 for a more similar pyzcat.
Jonathan Ellis said…
Still, the point that GzipFile's readline is hella slow vs zcat via subprocess remains.
Anonymous said…
of course, zcat does NOT search for newlines and split lines. You can't really compare orange and apples.
Jonathan Ellis said…
I'm comparing code to generate lines via GzipFile to code to generate lines via subprocess + zcat. That is certainly apples to apples.

I guess most people commenting here didn't read past the first block of code.

Popular posts from this blog

The Missing Piece in AI Coding: Automated Context Discovery

I recently switched tasks from writing the ColBERT Live! library and related benchmarking tools to authoring BM25 search for Cassandra . I was able to implement the former almost entirely with "coding in English" via Aider . That is: I gave the LLM tasks, in English, and it generated diffs for me that Aider applied to my source files. This made me easily 5x more productive vs writing code by hand, even with AI autocomplete like Copilot. It felt amazing! (Take a minute to check out this short thread on a real-life session with Aider , if you've never tried it.) Coming back to Cassandra, by contrast, felt like swimming through molasses. Doing everything by hand is tedious when you know that an LLM could do it faster if you could just structure the problem correctly for it. It felt like writing assembly without a compiler -- a useful skill in narrow situations, but mostly not a good use of human intelligence today. The key difference in these two sce...

Why PHP sucks

(July 8 2005) Apparently I got linked by some PHP sites, and while there were a few well-reasoned comments here I mostly just got people who only knew PHP reacting like I told them their firstborn was ugly. These people tended to give variants on one or more themes: All environments have warts, so PHP is no worse than anything else in this respect I can work around PHP's problems, ergo they are not really problems You aren't experienced enough in PHP to judge it yet As to the first, it is true that PHP is not alone in having warts. However, the lack of qualitative difference does not mean that the quantitative difference is insignificant. Similarly, problems can be worked around, but languages/environments designed by people with more foresight and, to put it bluntly, clue, simply don't make the kind of really boneheaded architecture mistakes that you can't help but run into on a daily baisis in PHP. Finally, as I noted in my original introduction, with PHP, ...

A week of Windows Subsystem for Linux

I first experimented with WSL2 as a daily development environment two years ago. Things were still pretty rough around the edges, especially with JetBrains' IDEs, and I ended up buying a dedicated Linux workstation so I wouldn't have to deal with the pain.  Unfortunately, the Linux box developed a heat management problem, and simultaneously I found myself needing a beefier GPU than it had for working on multi-vector encoding , so I decided to give WSL2 another try. Here's some of the highlights and lowlights. TLDR, it's working well enough that I'm probably going to continue using it as my primary development machine going forward. The Good NVIDIA CUDA drivers just work. I was blown away that I ran conda install cuda -c nvidia and it worked the first try. No farting around with Linux kernel header versions or arcane errors from nvidia-smi. It just worked, including with PyTorch. JetBrains products work a lot better now in remote development mod...