[英]Unable to read bigquery table with JSON/RECORD column type into spark dataframe. ( java.lang.IllegalStateException: Unexpected type: JSON)
we are trying to read a table from Bigquery to spark dataframe.我们正在尝试从 Bigquery 中读取一张表以激发 dataframe。
Following pyspark code is used for reading the data.以下 pyspark 代码用于读取数据。
from google.oauth2 import service_account
from google.cloud import bigquery
import json
import base64 as bs
from pyspark.sql.types import StructField, StructType, StringType, IntegerType, DoubleType, DecimalType
schema = "schema_name"
project_id = "project_id"
table_name = "simple"
# table_name = "jsonres"
schema_table_name = str(project_id) + "." + str(schema) + "." + str(table_name)
credentials_dict = {"Insert_actual_credentials": "here"}
credentials = service_account.Credentials.from_service_account_info(credentials_dict)
client = bigquery.Client(credentials=credentials, project=project_id)
query = "SELECT * FROM `{}`;".format(schema_table_name)
# print(query)
query_job = client.query(query)
query_job.result()
s = json.dumps(credentials_dict)
res = bs.b64encode(s.encode('utf-8'))
ans = res.decode("utf-8")
try:
df = spark.read.format('bigquery') \
.option("credentials", ans) \
.option("parentProject", project_id) \
.option("project", project_id) \
.option("mode", "DROPMALFORMED") \
.option('dataset', query_job.destination.dataset_id) \
.load(query_job.destination.table_id)
df.printSchema()
print(df)
df.show()
except Exception as exp:
print(exp)
For simple tables, we are able to read table as dataframe successfully.对于简单的表,我们能够成功读取表为 dataframe。
But when we have json column in the big query table as given below, we are getting error.但是当我们在下面给出的大查询表中有 json 列时,我们就会出错。
We are getting the following error.我们收到以下错误。
An error occurred while calling o1138.load.
调用 o1138.load 时出错。 : java.lang.IllegalStateException: Unexpected type: JSON at com.google.cloud.spark.bigquery.SchemaConverters.getStandardDataType(SchemaConverters.java:355) at com.google.cloud.spark.bigquery.SchemaConverters.lambda$getDataType$3(SchemaConverters.java:303)
: java.lang.IllegalStateException: Unexpected type: JSON at com.google.cloud.spark.bigquery.SchemaConverters.getStandardDataType(SchemaConverters.java:355) at com.google.cloud.spark.bigquery.SchemaConverters.lambda$getDataType$3( SchemaConverters.java:303)
We also tried by providing schema while reading the data.我们还尝试在读取数据时提供模式。
structureSchema = StructType([ \
StructField('x', StructType([
StructField('name', StringType(), True)
])),
StructField("y", DecimalType(), True) \
])
print(structureSchema)
try:
df = spark.read.format('bigquery') \
.option("credentials", ans) \
.option("parentProject", project_id) \
.option("project", project_id) \
.option("mode", "DROPMALFORMED") \
.option('dataset', query_job.destination.dataset_id) \
.schema(structureSchema) \
.load(query_job.destination.table_id)
df.printSchema()
print(df)
df.show()
except Exception as exp:
print(exp)
Still we faced the same error 'java.lang.IllegalStateException: Unexpected type: JSON'.我们仍然面临相同的错误“java.lang.IllegalStateException:意外类型:JSON”。
How to read bigquery table with json type into spark dataframe?如何将 json 类型的 bigquery 表读入 spark dataframe?
Update 1: There is an open issue in github regarding this.更新 1:github 中有一个关于此的未解决问题。
While reading a bigquery table, having a JSON type field from Apache Spark throws exception. 在读取 bigquery 表时,有一个来自 Apache 的 JSON 类型字段,Spark 抛出异常。
Is there any workaround for this?有什么解决方法吗?
Try the code below and check if it works for you, basically, you will keep the JSON column as a string, and then you can use the spark function to get the JSON content尝试下面的代码并检查它是否适合您,基本上,您会将 JSON 列保留为字符串,然后您可以使用 spark function 来获取 JSON 内容
import pyspark.sql.functions as f
structureSchema = StructType([
StructField('x', StringType()),
StructField("y", DecimalType())
])
df = (spark.read.format('bigquery')
.option("credentials", ans)
.option("parentProject", project_id)
.option("project", project_id)
.option("mode", "DROPMALFORMED")
.option('dataset', query_job.destination.dataset_id)
.schema(structureSchema)
.load(query_job.destination.table_id)
)
df = df.withColumn("jsonColumnName", f.get_json_object(f.col("x"), "$.name"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.