[英]Unable to read bigquery table with JSON/RECORD column type into spark dataframe. ( java.lang.IllegalStateException: Unexpected type: JSON)
我们正在尝试从 Bigquery 中读取一张表以激发 dataframe。
以下 pyspark 代码用于读取数据。
from google.oauth2 import service_account
from google.cloud import bigquery
import json
import base64 as bs
from pyspark.sql.types import StructField, StructType, StringType, IntegerType, DoubleType, DecimalType
schema = "schema_name"
project_id = "project_id"
table_name = "simple"
# table_name = "jsonres"
schema_table_name = str(project_id) + "." + str(schema) + "." + str(table_name)
credentials_dict = {"Insert_actual_credentials": "here"}
credentials = service_account.Credentials.from_service_account_info(credentials_dict)
client = bigquery.Client(credentials=credentials, project=project_id)
query = "SELECT * FROM `{}`;".format(schema_table_name)
# print(query)
query_job = client.query(query)
query_job.result()
s = json.dumps(credentials_dict)
res = bs.b64encode(s.encode('utf-8'))
ans = res.decode("utf-8")
try:
df = spark.read.format('bigquery') \
.option("credentials", ans) \
.option("parentProject", project_id) \
.option("project", project_id) \
.option("mode", "DROPMALFORMED") \
.option('dataset', query_job.destination.dataset_id) \
.load(query_job.destination.table_id)
df.printSchema()
print(df)
df.show()
except Exception as exp:
print(exp)
对于简单的表,我们能够成功读取表为 dataframe。
但是当我们在下面给出的大查询表中有 json 列时,我们就会出错。
我们收到以下错误。
调用 o1138.load 时出错。 : java.lang.IllegalStateException: Unexpected type: JSON at com.google.cloud.spark.bigquery.SchemaConverters.getStandardDataType(SchemaConverters.java:355) at com.google.cloud.spark.bigquery.SchemaConverters.lambda$getDataType$3( SchemaConverters.java:303)
我们还尝试在读取数据时提供模式。
structureSchema = StructType([ \
StructField('x', StructType([
StructField('name', StringType(), True)
])),
StructField("y", DecimalType(), True) \
])
print(structureSchema)
try:
df = spark.read.format('bigquery') \
.option("credentials", ans) \
.option("parentProject", project_id) \
.option("project", project_id) \
.option("mode", "DROPMALFORMED") \
.option('dataset', query_job.destination.dataset_id) \
.schema(structureSchema) \
.load(query_job.destination.table_id)
df.printSchema()
print(df)
df.show()
except Exception as exp:
print(exp)
我们仍然面临相同的错误“java.lang.IllegalStateException:意外类型:JSON”。
如何将 json 类型的 bigquery 表读入 spark dataframe?
更新 1:github 中有一个关于此的未解决问题。
在读取 bigquery 表时,有一个来自 Apache 的 JSON 类型字段,Spark 抛出异常。
有什么解决方法吗?
尝试下面的代码并检查它是否适合您,基本上,您会将 JSON 列保留为字符串,然后您可以使用 spark function 来获取 JSON 内容
import pyspark.sql.functions as f
structureSchema = StructType([
StructField('x', StringType()),
StructField("y", DecimalType())
])
df = (spark.read.format('bigquery')
.option("credentials", ans)
.option("parentProject", project_id)
.option("project", project_id)
.option("mode", "DROPMALFORMED")
.option('dataset', query_job.destination.dataset_id)
.schema(structureSchema)
.load(query_job.destination.table_id)
)
df = df.withColumn("jsonColumnName", f.get_json_object(f.col("x"), "$.name"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.