简体   繁体   English

如何将镶木地板字节对象作为 zipfile 写入磁盘

[英]How to write a parquet bytes object as zipfile to disk

I start with a pandas dataframe and I want to save that as a zipped parquet file, all in memory without intermediate steps on the disk.我从一个 Pandas 数据帧开始,我想将它保存为一个压缩的镶木地板文件,全部保存在内存中,没有磁盘上的中间步骤。 I have the following:我有以下几点:

bytes_buffer = BytesIO()
df.to_parquet(bytes_buffer)
bytes_value= bytes_buffer.getvalue()

with ZipFile('example.zip', 'w') as zip_obj:
    zip_obj.write(bytes_buffer.getvalue())

But I get this encoding error: ValueError: stat: embedded null character in path .但我收到此编码错误: ValueError: stat: embedded null character in path I got my infos from the only link I found on creating zipfiles from within the memory: https://www.neilgrogan.com/py-bin-zip/我从在内存中创建 zipfile 时找到的唯一链接获取了我的信息: https : //www.neilgrogan.com/py-bin-zip/

Thank your for your help :)谢谢你的帮助:)

The correct way to do this is:正确的做法是:

bytes_buffer = BytesIO()
df.to_parquet(bytes_buffer)
bytes_value= bytes_buffer.getvalue()

with ZipFile('example.zip', 'w') as zip_obj:
   zip_obj.writestr('file.parquet', bytes_buffer.getvalue())

But you should not that storing Parquet files in a ZIP just for compression reasons is removing a lot of benefits of the Parquet format itself.但是您不应该仅仅出于压缩原因将 Parquet 文件存储在 ZIP 中会消除 Parquet 格式本身的许多好处。 By default Parquet is already compressed with the Snappy compression code (but you can also use GZip, ZStandard, and others).默认情况下,Parquet 已经使用 Snappy 压缩代码进行了压缩(但您也可以使用 GZip、ZStandard 等)。 The compression is not happing on the file level but on a column-chunk level.压缩不是发生在文件级别,而是发生在列块级别。 That means when you access the file, only the parts which you want to read have to be decompressed.这意味着当您访问文件时,只需解压缩您要读取的部分。 In opposite to this, when you put the Parquet files into the ZIP, the whole file needs to be decompressed even when you only wanted to read a column selection.与此相反,当您将 Parquet 文件放入 ZIP 时,即使您只想读取列选择,也需要解压缩整个文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM