简体   繁体   中英

Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

Spark Dataframe Schema:

    StructType(
        [StructField("a", StringType(), False),
        StructField("b", StringType(), True),
        StructField("c" , BinaryType(), False),
        StructField("d", ArrayType(StringType(), False), True),
        StructField("e", TimestampType(), True)
        ])

When I write the data frame to parquet and load it into BigQuery, it interprets the schema differently. It is a simple load from JSON and write to parquet using spark dataframe.

BigQuery Schema:

            [
    {
        "type": "STRING",
        "name": "a",
        "mode": "REQUIRED"
    },
    {
        "type": "STRING",
        "name": "b",
        "mode": "NULLABLE"
    },
    {
        "type": "BYTES",
        "name": "c",
        "mode": "REQUIRED"
    },
    {
        "fields": [
        {
            "fields": [
            {
                "type": "STRING",
                "name": "element",
                "mode": "NULLABLE"
            }
            ],
            "type": "RECORD",
            "name": "list",
            "mode": "REPEATED"
        }
        ],
        "type": "RECORD",
        "name": "d",
        "mode": "NULLABLE"
    },
    {
        "type": "TIMESTAMP",
        "name": "e",
        "mode": "NULLABLE"
    }
    ]

Is this something to do with the way spark writes or they way BigQuery reads parquet. Any idea how I can fix this?

This is due to the intermediate file format (parquet by default) that thespark-bigquery connector uses.

The connector first writes the data to parquet files, then loads them to BigQuery using BigQuery Insert API.

If you check the intermediate parquet schema using parquet-tools , you would find something like this the field d (ArrayType(StringType) in Spark)

 optional group a (LIST) {
    repeated group list {
      optional binary element (STRING);
    }
  }

Now, if you were loading this parquet yourself in BigQuery using bq load or the BigQuery Insert API directly, you could be able to tell BQ to ignore the intermediate fields by enabling parquet_enable_list_inference

Unfortunately, I don't see how to enable this option when using the spark-bigquery connector!

As a workaround, you can try to use orc as the intermediate format.

       df
        .write
        .format("bigquery")
        .option("intermediateFormat", "orc")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM