简体   繁体   English

如何使用 Pandas 编写分区的 Parquet 文件

[英]How to write a partitioned Parquet file using Pandas

I'm trying to write a Pandas dataframe to a partitioned file:我正在尝试将 Pandas 数据帧写入分区文件:

df.to_parquet('output.parquet', engine='pyarrow', partition_cols = ['partone', 'partwo'])

TypeError: __cinit__() got an unexpected keyword argument 'partition_cols'

From the documentation I expected that the partition_cols would be passed as a kwargs to the pyarrow library.从文档中我预计partition_cols将作为 kwargs 传递给 pyarrow 库。 How can a partitioned file be written to local disk using pandas?如何使用 Pandas 将分区文件写入本地磁盘?

Pandas DataFrame.to_parquet is a thin wrapper over table = pa.Table.from_pandas(...) and pq.write_table(table, ...) (see pandas.parquet.py#L120 ), and pq.write_table does not support writing partitioned datasets. Pandas DataFrame.to_parquettable = pa.Table.from_pandas(...)pq.write_table(table, ...) (参见pandas.parquet.py#L120 )的薄包装,并且pq.write_table不支持写入分区数据集。 You should use pq.write_to_dataset instead.您应该改用pq.write_to_dataset

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(yourData)
table = pa.Table.from_pandas(df)

pq.write_to_dataset(
    table,
    root_path='output.parquet',
    partition_cols=['partone', 'parttwo'],
)

For more info, see pyarrow documentation .有关更多信息,请参阅pyarrow 文档

In general, I would always use the PyArrow API directly when reading / writing parquet files, since the Pandas wrapper is rather limited in what it can do.通常,在读取/写入镶木地板文件时,我总是直接使用 PyArrow API,因为 Pandas 包装器的功能相当有限。

You need to update to Pandas version 0.24 or above.您需要更新到 Pandas 0.24 或更高版本。 The functionality of partition_cols is added from that version onwards.从该版本开始添加 partition_cols 的功能。

First make sure that you have a reasonably recent version of pandas and pyarrow:首先确保你有一个相当新的 pandas 和 pyarrow 版本:

pyenv shell 3.8.2
python -m venv venv
source venv/bin/activate
pip install pandas pyarrow
pip freeze | grep pandas # pandas==1.2.3
pip freeze | grep pyarrow # pyarrow==3.0.0

Then you can use partition_cols to produce the partitioned parquet files:然后你可以使用partition_cols来生成分区的镶木地板文件:

import pandas as pd

# example dataframe with 3 rows and columns year,month,day,value
df = pd.DataFrame(data={'year':  [2020, 2020, 2021],
                        'month': [1,12,2], 
                        'day':   [1,31,28], 
                        'value': [1000,2000,3000]})

df.to_parquet('./mydf', partition_cols=['year', 'month', 'day'])

This produces:这产生:

mydf/year=2020/month=1/day=1/6f0258e6c48a48dbb56cae0494adf659.parquet
mydf/year=2020/month=12/day=31/cf8a45116d8441668c3a397b816cd5f3.parquet
mydf/year=2021/month=2/day=28/7f9ba3f37cb9417a8689290d3f5f9e6e.parquet

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 Pandas Parquet 分区为 s3 - write pandas parquet partitioned into s3 如何使用 pandas 使用 zstandard 压缩镶木地板文件 - How to compress parquet file with zstandard using pandas 如何使用 Pandas 读取镶木地板文件 - How to read parquet file using Pandas 如何使用 python 从 s3 读取按日期文件夹分区的镶木地板文件? - How to read parquet file partitioned by date folder from s3 using python? 如何从按月份分区的实木复合地板文件中删除特定月份 - How to delete a particular month from a parquet file partitioned by month 如何使用 pyspark 在 Hadoop 中读取 parquet 文件、更改数据类型并写入另一个 Parquet 文件 - How to Read a parquet file , change datatype and write to another Parquet file in Hadoop using pyspark 如何使用 Spark (pyspark) 编写镶木地板文件? - How can I write a parquet file using Spark (pyspark)? 如何在没有足够 RAM 的情况下使用 Pandas 打开巨大的镶木地板文件 - How to open huge parquet file using Pandas without enough RAM 如何使用python访问分区的熊猫数据框 - How to access partitioned pandas dataframe using python 如何有效地将多个 pyarrow 表(>1,000 个表)写入分区 parquet 数据集? - How to efficiently write multiple pyarrow tables (>1,000 tables) to a partitioned parquet dataset?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM