简体   繁体   中英

Write Pandas Dataframe parquet metadata with partition columns

I am able to write a parquet file with partition_cols, but not the respective metadata. Seems there's a schema mismatch on the table vs metadata due to the columns in my partition.

Need some help sorting out what I'm doing wrong -

The code,

df = pd.DataFrame(dictReprForDf)

table=pa.Table.from_pandas(df)

metadata_collector=[]

pq.write_to_dataset(table, outputFilePath, metadata_collector=metadata_collector, partition_cols=['A','B','C'])

pq.write_metadata(table.schema, outputFilePath+'/_common_metadata')

pq.write_metadata(table.schema, outputFilePath+'/_metadata',metadata_collector=metadata_collector)

Error:

File "pyarrow\_parquet.pyx", line 616, in pyarrow._parquet.FileMetaData.append_row_groups
RuntimeError: AppendRowGroups require equal schema

Noteworthy that this code works/no errors if I don't set partition_cols on the pq.write_to_dataset.

Found a solution by checking how they do this in dask .

root_path = Path("partitioned_data")
metadata_collector = []
partition_cols = ["partition_col1", "partition_col2"]

subschema = table.schema

for col in partition_cols:
    subschema = subschema.remove(subschema.get_field_index(col))

pa.parquet.write_to_dataset(
    table, root_path=root_path, partition_cols=partition_cols,
    metadata_collector=metadata_collector,
)

pq.write_metadata(subschema, root_path / "_common_metadata")
pq.write_metadata(subschema, root_path / "_metadata", metadata_collector=metadata_collector)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM