[英]How to create schema for nested JSON column in PySpark?
I have a parquet file with multiple columns and out of those I have 2 columns which are JSON/Struct, but their type is string.我有一个包含多列的镶木地板文件,其中我有 2 列是 JSON/Struct,但它们的类型是字符串。 There can be any number of array_elements present.可以存在任意数量的 array_elements。
{
"addressline": [
{
"array_element": "F748DK’8U1P9’2ZLKXE"
},
{
"array_element": "’O’P0BQ04M-"
},
{
"array_element": "’fvrvrWEM-"
}
],
"telephone": [
{
"array_element": {
"locationtype": "8.PLT",
"countrycode": null,
"phonenumber": "000000000",
"phonetechtype": "1.PTT",
"countryaccesscode": null,
"phoneremark": null
}
}
]
}
How can I create a schema to handle these columns in PySpark?如何创建一个模式来处理 PySpark 中的这些列?
Treating the example you provided as string I have created this dataframe:将您提供的示例视为字符串,我创建了这个数据框:
from pyspark.sql import functions as F, types as T
df = spark.createDataFrame([('{"addressline":[{"array_element":"F748DK’8U1P9’2ZLKXE"},{"array_element":"’O’P0BQ04M-"},{"array_element":"’fvrvrWEM-"}],"telephone":[{"array_element":{"locationtype":"8.PLT","countrycode":null,"phonenumber":"000000000","phonetechtype":"1.PTT","countryaccesscode":null,"phoneremark":null}}]}',)], ['c1'])
This is a schema to be applied to this column:这是要应用于此列的架构:
schema = T.StructType([
T.StructField('addressline', T.ArrayType(T.StructType([
T.StructField('array_element', T.StringType())
]))),
T.StructField('telephone', T.ArrayType(T.StructType([
T.StructField('array_element', T.StructType([
T.StructField('locationtype', T.StringType()),
T.StructField('countrycode', T.StringType()),
T.StructField('phonenumber', T.StringType()),
T.StructField('phonetechtype', T.StringType()),
T.StructField('countryaccesscode', T.StringType()),
T.StructField('phoneremark', T.StringType()),
]))
])))
])
Results providing the schema to the from_json
function:将架构提供给from_json
函数的结果:
df = df.withColumn('c1', F.from_json('c1', schema))
df.show()
# +-------------------------------------------------------------------------------------------------------+
# |c1 |
# +-------------------------------------------------------------------------------------------------------+
# |{[{F748DK’8U1P9’2ZLKXE}, {’O’P0BQ04M-}, {’fvrvrWEM-}], [{{8.PLT, null, 000000000, 1.PTT, null, null}}]}|
# +-------------------------------------------------------------------------------------------------------+
df.printSchema()
# root
# |-- c1: struct (nullable = true)
# | |-- addressline: array (nullable = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- array_element: string (nullable = true)
# | |-- telephone: array (nullable = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- array_element: struct (nullable = true)
# | | | | |-- locationtype: string (nullable = true)
# | | | | |-- countrycode: string (nullable = true)
# | | | | |-- phonenumber: string (nullable = true)
# | | | | |-- phonetechtype: string (nullable = true)
# | | | | |-- countryaccesscode: string (nullable = true)
# | | | | |-- phoneremark: string (nullable = true)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.