I am able to write a parquet file with partition_cols, but not the respective metadata. Seems there's a schema mismatch on the table vs metadata due to the columns in my partition.
Need some help sorting out what I'm doing wrong -
The code,
df = pd.DataFrame(dictReprForDf)
table=pa.Table.from_pandas(df)
metadata_collector=[]
pq.write_to_dataset(table, outputFilePath, metadata_collector=metadata_collector, partition_cols=['A','B','C'])
pq.write_metadata(table.schema, outputFilePath+'/_common_metadata')
pq.write_metadata(table.schema, outputFilePath+'/_metadata',metadata_collector=metadata_collector)
Error:
File "pyarrow\_parquet.pyx", line 616, in pyarrow._parquet.FileMetaData.append_row_groups
RuntimeError: AppendRowGroups require equal schema
Noteworthy that this code works/no errors if I don't set partition_cols on the pq.write_to_dataset.
Found a solution by checking how they do this in dask
.
root_path = Path("partitioned_data")
metadata_collector = []
partition_cols = ["partition_col1", "partition_col2"]
subschema = table.schema
for col in partition_cols:
subschema = subschema.remove(subschema.get_field_index(col))
pa.parquet.write_to_dataset(
table, root_path=root_path, partition_cols=partition_cols,
metadata_collector=metadata_collector,
)
pq.write_metadata(subschema, root_path / "_common_metadata")
pq.write_metadata(subschema, root_path / "_metadata", metadata_collector=metadata_collector)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.