简体   繁体   English

无法将具有 JSON/RECORD 列类型的 bigquery 表读入 spark dataframe。(java.lang.IllegalStateException:意外类型:JSON)

[英]Unable to read bigquery table with JSON/RECORD column type into spark dataframe. ( java.lang.IllegalStateException: Unexpected type: JSON)

we are trying to read a table from Bigquery to spark dataframe.我们正在尝试从 Bigquery 中读取一张表以激发 dataframe。

Strucute of the table is表的结构是简单表

Following pyspark code is used for reading the data.以下 pyspark 代码用于读取数据。

    from google.oauth2 import service_account
    from google.cloud import bigquery
    import json
    import base64 as bs
    from pyspark.sql.types import StructField, StructType, StringType, IntegerType, DoubleType, DecimalType
    
    schema = "schema_name"
    project_id = "project_id"
    
    table_name = "simple"
    # table_name = "jsonres"
    schema_table_name = str(project_id) + "." + str(schema) + "." + str(table_name)
    credentials_dict = {"Insert_actual_credentials": "here"}

    credentials = service_account.Credentials.from_service_account_info(credentials_dict)
    client = bigquery.Client(credentials=credentials, project=project_id)
    
    query = "SELECT * FROM `{}`;".format(schema_table_name)
    # print(query)
    query_job = client.query(query)
    query_job.result()
    
    s = json.dumps(credentials_dict)
    res = bs.b64encode(s.encode('utf-8'))
    ans = res.decode("utf-8")
    
    try:
        df = spark.read.format('bigquery') \
            .option("credentials", ans) \
            .option("parentProject", project_id) \
            .option("project", project_id) \
            .option("mode", "DROPMALFORMED") \
            .option('dataset', query_job.destination.dataset_id) \
            .load(query_job.destination.table_id)
        df.printSchema()
        print(df)
        df.show()
    except Exception as exp:
        print(exp)

For simple tables, we are able to read table as dataframe successfully.对于简单的表,我们能够成功读取表为 dataframe。

But when we have json column in the big query table as given below, we are getting error.但是当我们在下面给出的大查询表中有 json 列时,我们就会出错。 json_col_table

We are getting the following error.我们收到以下错误。

An error occurred while calling o1138.load.调用 o1138.load 时出错。 : java.lang.IllegalStateException: Unexpected type: JSON at com.google.cloud.spark.bigquery.SchemaConverters.getStandardDataType(SchemaConverters.java:355) at com.google.cloud.spark.bigquery.SchemaConverters.lambda$getDataType$3(SchemaConverters.java:303) : java.lang.IllegalStateException: Unexpected type: JSON at com.google.cloud.spark.bigquery.SchemaConverters.getStandardDataType(SchemaConverters.java:355) at com.google.cloud.spark.bigquery.SchemaConverters.lambda$getDataType$3( SchemaConverters.java:303)

We also tried by providing schema while reading the data.我们还尝试在读取数据时提供模式。

structureSchema = StructType([ \
        StructField('x', StructType([
             StructField('name', StringType(), True)
             ])),
    StructField("y", DecimalType(), True) \
  ])
print(structureSchema)

try:
    df = spark.read.format('bigquery') \
        .option("credentials", ans) \
        .option("parentProject", project_id) \
        .option("project", project_id) \
        .option("mode", "DROPMALFORMED") \
        .option('dataset', query_job.destination.dataset_id) \
        .schema(structureSchema) \
        .load(query_job.destination.table_id)
    df.printSchema()
    print(df)
    df.show()
except Exception as exp:
    print(exp)

Still we faced the same error 'java.lang.IllegalStateException: Unexpected type: JSON'.我们仍然面临相同的错误“java.lang.IllegalStateException:意外类型:JSON”。

How to read bigquery table with json type into spark dataframe?如何将 json 类型的 bigquery 表读入 spark dataframe?

Update 1: There is an open issue in github regarding this.更新 1:github 中有一个关于此的未解决问题。

While reading a bigquery table, having a JSON type field from Apache Spark throws exception. 在读取 bigquery 表时,有一个来自 Apache 的 JSON 类型字段,Spark 抛出异常。

Is there any workaround for this?有什么解决方法吗?

Try the code below and check if it works for you, basically, you will keep the JSON column as a string, and then you can use the spark function to get the JSON content尝试下面的代码并检查它是否适合您,基本上,您会将 JSON 列保留为字符串,然后您可以使用 spark function 来获取 JSON 内容

import pyspark.sql.functions as f

structureSchema = StructType([
    StructField('x', StringType()),
    StructField("y", DecimalType())
  ])

df = (spark.read.format('bigquery')
        .option("credentials", ans)
        .option("parentProject", project_id)
        .option("project", project_id)
        .option("mode", "DROPMALFORMED")
        .option('dataset', query_job.destination.dataset_id)
        .schema(structureSchema)
        .load(query_job.destination.table_id)
     )

df = df.withColumn("jsonColumnName", f.get_json_object(f.col("x"), "$.name"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 Spark 写入 BigQuery 中的 JSON 列类型 - Writing to a JSON column type in BigQuery using Spark BigQuery 中 JSON 类型列的 SELECT 个值 - SELECT values from JSON type column in BigQuery 使用 Java 创建具有 BIGNUMERIC 数据类型的 BigQuery 表列 - Creating BigQuery table column with BIGNUMERIC data type using Java BigQuery 的查询过滤器按 json 类型 - BigQuery's query filter by json type java.lang.IllegalStateException:Kotlin NavigationComponent App 中没有当前导航节点 - java.lang.IllegalStateException: no current navigation node in Kotlin NavigationComponent App 在 java.lang.String 类型的对象的 JSON 反序列化期间出现问题 - Issue during JSON deserialization of an object of type java.lang.String 如何从 Java 中的 bigquery 读取字节类型? - How to read bytes type from bigquery in Java? Spark 读取 BigQuery 外部表 - Spark Read BigQuery External Table java.lang.IllegalStateException:连接池关闭于 - java.lang.IllegalStateException: Connection pool shut down at Firebase Crashlytics - java.lang.IllegalStateException:缺少 Crashlytics 构建 ID - Firebase Crashlytics - java.lang.IllegalStateException: The Crashlytics build ID is missing
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM