How to write a parquet bytes object as zipfile to disk

Question

I start with a pandas dataframe and I want to save that as a zipped parquet file, all in memory without intermediate steps on the disk. I have the following:

bytes_buffer = BytesIO()
df.to_parquet(bytes_buffer)
bytes_value= bytes_buffer.getvalue()

with ZipFile('example.zip', 'w') as zip_obj:
    zip_obj.write(bytes_buffer.getvalue())

But I get this encoding error: ValueError: stat: embedded null character in path . I got my infos from the only link I found on creating zipfiles from within the memory: https://www.neilgrogan.com/py-bin-zip/

Thank your for your help :)

Answer 1

The correct way to do this is:

bytes_buffer = BytesIO()
df.to_parquet(bytes_buffer)
bytes_value= bytes_buffer.getvalue()

with ZipFile('example.zip', 'w') as zip_obj:
   zip_obj.writestr('file.parquet', bytes_buffer.getvalue())

But you should not that storing Parquet files in a ZIP just for compression reasons is removing a lot of benefits of the Parquet format itself. By default Parquet is already compressed with the Snappy compression code (but you can also use GZip, ZStandard, and others). The compression is not happing on the file level but on a column-chunk level. That means when you access the file, only the parts which you want to read have to be decompressed. In opposite to this, when you put the Parquet files into the ZIP, the whole file needs to be decompressed even when you only wanted to read a column selection.

How to write a parquet bytes object as zipfile to disk

Question

1 answers

solution1
1 2020-03-20 15:01:43

How to write a parquet bytes object as zipfile to disk

Question

1 answers

solution1 1 2020-03-20 15:01:43

solution1
1 2020-03-20 15:01:43