简体   繁体   中英

How to write a parquet bytes object as zipfile to disk

I start with a pandas dataframe and I want to save that as a zipped parquet file, all in memory without intermediate steps on the disk. I have the following:

bytes_buffer = BytesIO()
df.to_parquet(bytes_buffer)
bytes_value= bytes_buffer.getvalue()

with ZipFile('example.zip', 'w') as zip_obj:
    zip_obj.write(bytes_buffer.getvalue())

But I get this encoding error: ValueError: stat: embedded null character in path . I got my infos from the only link I found on creating zipfiles from within the memory: https://www.neilgrogan.com/py-bin-zip/

Thank your for your help :)

The correct way to do this is:

bytes_buffer = BytesIO()
df.to_parquet(bytes_buffer)
bytes_value= bytes_buffer.getvalue()

with ZipFile('example.zip', 'w') as zip_obj:
   zip_obj.writestr('file.parquet', bytes_buffer.getvalue())

But you should not that storing Parquet files in a ZIP just for compression reasons is removing a lot of benefits of the Parquet format itself. By default Parquet is already compressed with the Snappy compression code (but you can also use GZip, ZStandard, and others). The compression is not happing on the file level but on a column-chunk level. That means when you access the file, only the parts which you want to read have to be decompressed. In opposite to this, when you put the Parquet files into the ZIP, the whole file needs to be decompressed even when you only wanted to read a column selection.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM