Split equivalent of gzip files in python

Question

I'm trying to replicate this bash command in Bash which returns each file gzipped 50MB each.

split -b 50m "file.dat.gz" "file.dat.gz.part-"

My attempt at the python equivalent

import gzip

infile_name = "file.dat.gz"

chunk = 50*1024*1024 # 50MB

with gzip.open(infile_name, 'rb') as infile:
    for n, raw_bytes in enumerate(iter(lambda: infile.read(slice), '')):
        print(n, chunk)
        with gzip.open('{}.part-{}'.format(infile_name[:-3], n), 'wb') as outfile:
            outfile.write(raw_bytes)

This returns 15MB each gzipped. When I gunzip the files, then they are 50MB each.

How do I split the gzipped file in python so that split up files are each 50MB each before gunzipping?

Answer 1

I don't believe that split works the way you think it does. It doesn't split the gzip file into smaller gzip files. Ie you can't call gunzip on the individual files it creates. It literally breaks up the data into smaller chunks and if you want to gunzip it, you have to concatenate all the chunks back together first. So, to emulate the actual behavior with Python, we'd do something like:

infile_name = "file.dat.gz"

chunk = 50*1024*1024 # 50MB

with open(infile_name, 'rb') as infile:
    for n, raw_bytes in enumerate(iter(lambda: infile.read(chunk), b'')):
        print(n, chunk)
        with open('{}.part-{}'.format(infile_name[:-3], n), 'wb') as outfile:
            outfile.write(raw_bytes)

In reality we'd read multiple smaller input chunks to make one output chunk to use less memory.

We might be able to break the file into smaller files that we can individually gunzip, and still make our target size. Using something like a bytesIO stream, we could gunzip the file and gzip it into that memory stream until it was the target size then write it out and start a new bytesIO stream.

With compressed data, you have to measure the size of the output, not the size of the input as we can't predict how well the data will compress.

Answer 2

Here's a solution for emulating something like the split -l (split on lines) command option that will allow you to open each individual file with gunzip.

import io
import os
import shutil
from xopen import xopen

def split(infile_name, num_lines ):
    
    infile_name_fp = infile_name.split('/')[-1].split('.')[0] #get first part of file name
    cur_dir = '/'.join(infile_name.split('/')[0:-1])
    out_dir = f'{cur_dir}/{infile_name_fp}_split'
    if os.path.exists(out_dir):
        shutil.rmtree(out_dir)
    os.makedirs(out_dir) #create in same folder as the original .csv.gz file
    
    m=0
    part=0
    buf=io.StringIO() #initialize buffer
    with xopen(infile_name, 'rt') as infile:
        for line in infile:
            if m<num_lines: #fill up buffer
                buf.write(line)
                m+=1
            else: #write buffer to file
                with xopen(f'{out_dir}/{infile_name_fp}.part-{str(part).zfill(5)}.csv.gz', mode='wt', compresslevel=6) as outfile:
                            outfile.write(buf.getvalue())
                m=0
                part+=1
                buf=io.StringIO() #flush buffer -> faster than seek(0); truncate(0);
        
        #write whatever is left in buffer to file
        with xopen(f'{out_dir}/{infile_name_fp}.part-{str(part).zfill(5)}.csv.gz', mode='wt', compresslevel=6) as outfile:
            outfile.write(buf.getvalue())
        buf.close()

Usage:

split('path/to/myfile.csv.gz', num_lines=100000)

Outputs a folder with split files at path/to/myfile_split .

Discussion: I've used xopen here for additional speed, but you may choose to use gzip.open if you want to stay with Python native packages. Performance-wise, I've benchmarked this to take about twice as long as a solution combining pigz and split . It's not bad, but could be better. The bottleneck is the for loop and the buffer, so maybe rewriting this to work asynchronously would be more performant.

Split equivalent of gzip files in python

Question

2 answers

solution1
4 ACCPTED 2017-07-22 04:49:50

solution2
0 2020-12-13 01:00:27

Split equivalent of gzip files in python

Question

2 answers

solution1 4 ACCPTED 2017-07-22 04:49:50

solution2 0 2020-12-13 01:00:27

solution1
4 ACCPTED 2017-07-22 04:49:50

solution2
0 2020-12-13 01:00:27