[英]How to compress parquet file with zstandard using pandas
i'm using pandas to convert dataframes to.parquet files using this command:我正在使用 pandas 使用以下命令将数据帧转换为 .parquet 文件:
df.to_parquet(file_name, engine='pyarrow', compression='gzip')
I need to use zstandard as compression algorithm, but the function above accepts only gzip, snappy, and brotli.我需要使用 zstandard 作为压缩算法,但是上面的 function 只接受 gzip、snappy 和 brotli。 I tried Is there a way to include zstd in this function?
我试过有没有办法在这个 function 中包含 zstd? If not, how can i do that with other packages?
如果没有,我怎么能用其他包做到这一点? I tried with zstandard , but it seems to accept only bytes-like objects.
我尝试使用zstandard ,但它似乎只接受类似字节的对象。
I usually use zstandard as my compression algorithm for my dataframes.我通常使用 zstandard 作为我的数据帧的压缩算法。
This is the code I use (a bit simplified) to write those parquet files:这是我用来编写这些镶木地板文件的代码(有点简化):
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
parquetFilename = "test.parquet"
df = pd.DataFrame(
{
"num_legs": [2, 4, 8, 0],
"num_wings": [2, 0, 0, 0],
"num_specimen_seen": [10, 2, 1, 8],
},
index=["falcon", "dog", "spider", "fish"],
)
df = pa.Table.from_pandas(df)
pq.write_table(df, parquetFilename, compression="zstd")
And to read these parquet files:并阅读这些镶木地板文件:
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
parquetFilename = "test.parquet"
df = pq.read_table(parquetFilename)
df = df.to_pandas()
For more details see these sites for more information:有关更多详细信息,请参阅以下网站以获取更多信息:
Finally a shameless plug for a blog post I wrote .最后是我写的一篇博文的无耻插件。 It is about the speed vs space balance of zstandard and snappy compression in parquet files using pyarrow.
它是关于 zstandard 和使用 pyarrow 在 parquet 文件中快速压缩的速度与空间平衡。 It is relevent to your question and includes some more "real world" code examples of reading and writing parquet files in zstandard.
它与您的问题相关,并包含一些在 zstandard 中读取和写入 parquet 文件的更多“真实世界”代码示例。 I will actually be writing a follow up soon too.
实际上,我也会很快写一篇后续文章。 if you're interested let me know.
如果你有兴趣让我知道。
It seems it is not supported yet:似乎还不支持:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html
compression{'snappy', 'gzip', 'brotli', None}, default 'snappy' Name of the compression to use.
compression{'snappy', 'gzip', 'brotli', None}, default 'snappy' 要使用的压缩名称。 Use None for no compression.
使用 None 表示不压缩。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.