简体   繁体   中英

Bigquery parquet file treats list<string> as list<int32> when empty array is passed

I have a large nested terabyte sized jsonl(s) which I am converting to parquet files and writing to a have partitioned google cloud storage bucket.

The issue is as follows. One of the nested fields is a list of string ideally the schema for this field I expect is billing_code_modifier: list<item: string> , but there is a rare case the sometimes the length of the list is 0 for all records in which case pandas writes the billing_code_modifier: list<item: null>

This causes an issue since the third party tool [bigquery] which is being used to read these parquet files fail to read these due to inconsistent schema expecting list not list [it defaults empty arrays to int32, blame google not me]

How does one get around this. Is there a way to specify the schema while writing parquet files. Since I am dealing with a bucket I cannot write an empty parquet and then add the data to the file in 2 separate write operations as GCP does not allow you to modify files only overwrite

For Pandas you can specify an Arrow schema as a kwarg which should provide the correct schema. See Pyarrow apply schema when using pandas to_parquet() for details.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM