简体   繁体   English

Spark 写 Parquet 数组<string>加载到 BigQuery 时转换为不同的数据类型</string>

[英]Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

Spark Dataframe Schema:火花 Dataframe 架构:

    StructType(
        [StructField("a", StringType(), False),
        StructField("b", StringType(), True),
        StructField("c" , BinaryType(), False),
        StructField("d", ArrayType(StringType(), False), True),
        StructField("e", TimestampType(), True)
        ])

When I write the data frame to parquet and load it into BigQuery, it interprets the schema differently.当我将数据框写入 parquet 并将其加载到 BigQuery 中时,它以不同的方式解释架构。 It is a simple load from JSON and write to parquet using spark dataframe.它是从 JSON 简单加载并使用 spark dataframe 写入 parquet。

BigQuery Schema:大查询模式:

            [
    {
        "type": "STRING",
        "name": "a",
        "mode": "REQUIRED"
    },
    {
        "type": "STRING",
        "name": "b",
        "mode": "NULLABLE"
    },
    {
        "type": "BYTES",
        "name": "c",
        "mode": "REQUIRED"
    },
    {
        "fields": [
        {
            "fields": [
            {
                "type": "STRING",
                "name": "element",
                "mode": "NULLABLE"
            }
            ],
            "type": "RECORD",
            "name": "list",
            "mode": "REPEATED"
        }
        ],
        "type": "RECORD",
        "name": "d",
        "mode": "NULLABLE"
    },
    {
        "type": "TIMESTAMP",
        "name": "e",
        "mode": "NULLABLE"
    }
    ]

Is this something to do with the way spark writes or they way BigQuery reads parquet.这与 spark 的写入方式有关,还是与 BigQuery 读取镶木地板的方式有关。 Any idea how I can fix this?知道我该如何解决这个问题吗?

This is due to the intermediate file format (parquet by default) that thespark-bigquery connector uses.这是由于spark-bigquery 连接器使用的中间文件格式(默认为 parquet)。

The connector first writes the data to parquet files, then loads them to BigQuery using BigQuery Insert API.连接器首先将数据写入 parquet 文件,然后使用 BigQuery Insert API 将它们加载到 BigQuery。

If you check the intermediate parquet schema using parquet-tools , you would find something like this the field d (ArrayType(StringType) in Spark)如果您使用parquet-tools检查中间镶木地板模式,您会发现类似这样的字段d (Spark 中的 ArrayType(StringType))

 optional group a (LIST) {
    repeated group list {
      optional binary element (STRING);
    }
  }

Now, if you were loading this parquet yourself in BigQuery using bq load or the BigQuery Insert API directly, you could be able to tell BQ to ignore the intermediate fields by enabling parquet_enable_list_inference现在,如果您使用bq load或 BigQuery Insert API 直接在 BigQuery 中自己加载此镶木地板,您可以通过启用parquet_enable_list_inference告诉 BQ 忽略中间字段

Unfortunately, I don't see how to enable this option when using the spark-bigquery connector!不幸的是,在使用 spark-bigquery 连接器时,我看不到如何启用此选项!

As a workaround, you can try to use orc as the intermediate format.作为解决方法,您可以尝试使用orc作为中间格式。

       df
        .write
        .format("bigquery")
        .option("intermediateFormat", "orc")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM