简体繁体中英

when empty array is passed

原文 2022-11-22 14:04:52 4 1 python/ google-bigquery/ parquet/ pyarrow

I have a large nested terabyte sized jsonl(s) which I am converting to parquet files and writing to a have partitioned google cloud storage bucket.

The issue is as follows. One of the nested fields is a list of string ideally the schema for this field I expect is billing_code_modifier: list<item: string> , but there is a rare case the sometimes the length of the list is 0 for all records in which case pandas writes the billing_code_modifier: list<item: null>

This causes an issue since the third party tool [bigquery] which is being used to read these parquet files fail to read these due to inconsistent schema expecting list not list [it defaults empty arrays to int32, blame google not me]

How does one get around this. Is there a way to specify the schema while writing parquet files. Since I am dealing with a bucket I cannot write an empty parquet and then add the data to the file in 2 separate write operations as GCP does not allow you to modify files only overwrite

1 answers

For Pandas you can specify an Arrow schema as a kwarg which should provide the correct schema. See Pyarrow apply schema when using pandas to_parquet() for details.

Parquet column cannot be converted in file, Expected: bigint, Found: INT32

BigQuery load parquet file

Unable to unnest in bigquery when there is comma inside list which is of type string

BigQuery: UNNESTING string representation of list of JSONs

List not rendering even when the array is not empty and the state is not null

Count the length of an array with bigquery returns 1 when empty

AWS CLI returns empty array when attempting to list Elastic Beanstalk applications

one hot encode list in bigquery

_TypeError (type 'List<dynamic>' is not a subtype of type 'String?') When trying to get snapshot as in array

SQL/BigQuery - List of products sold together

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Parquet column cannot be converted in file, Expected: bigint, Found: INT32 BigQuery load parquet file Unable to unnest in bigquery when there is comma inside list which is of type string BigQuery: UNNESTING string representation of list of JSONs List not rendering even when the array is not empty and the state is not null Count the length of an array with bigquery returns 1 when empty AWS CLI returns empty array when attempting to list Elastic Beanstalk applications one hot encode list in bigquery _TypeError (type 'List<dynamic>' is not a subtype of type 'String?') When trying to get snapshot as in array SQL/BigQuery - List of products sold together

Related Tags

Bigquery parquet file treats list<string> as list<int32> when empty array is passed

Question

1 answers

solution1 1 2022-11-23 22:27:20

solution1
1 2022-11-23 22:27:20