I have a large nested terabyte sized jsonl(s) which I am converting to parquet files and writing to a have partitioned google cloud storage bucket.
The issue is as follows. One of the nested fields is a list of string ideally the schema for this field I expect is billing_code_modifier: list<item: string>
, but there is a rare case the sometimes the length of the list is 0 for all records in which case pandas writes the billing_code_modifier: list<item: null>
This causes an issue since the third party tool [bigquery] which is being used to read these parquet files fail to read these due to inconsistent schema expecting list not list [it defaults empty arrays to int32, blame google not me]
How does one get around this. Is there a way to specify the schema while writing parquet files. Since I am dealing with a bucket I cannot write an empty parquet and then add the data to the file in 2 separate write operations as GCP does not allow you to modify files only overwrite
For Pandas you can specify an Arrow schema as a kwarg which should provide the correct schema. See Pyarrow apply schema when using pandas to_parquet() for details.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.