Pyspark 'from_json', dataframe 為所有 json 列返回 null

Question

使用 python（版本 3.7.12）和 pyspark（版本 2.4.0）。

我正在嘗試使用列和已識別架構的 from_json 語句。 但是，df 返回為 null 。 我假設我錯誤地識別了列的架構和類型。

以下代碼是我使用 get_json_object 從表中提取的 json 字符串：

df = df.select(col('id'), get_json_object(col("pulled_col"), "$.data"))

df.head()

#Row(id = '0123456', data = '[
#{"time" : [], "history" : [], "zip" : "78910", "phnumber" : #"5678910123", "name" : "-"},
#{"time" : [], "history" : [], "zip" : "78920", "phnumber" : #"5678910123", "name" : "-"},
#{"time" : [], "history" : [], "zip" : "78930", "phnumber" : #"5678910123", "name" : "-"},
#{"time" : [], "history" : [], "zip" : "78910", "phnumber" : #"5678910123", "name" : "-"}
#]')

df.printSchema()

#root
# |-- id: string (nullable = true)
# |-- data: string (nullable = true)

df.show()

#+-------+----------------------------+
#|     id|                        data|
#+-------+----------------------------+
#|0123456|[{"time" : [], "history"....|
#|0123456|[{"time" : [], "history"....|
#+-------+----------------------------+

test = df.select(col("id"), get_json_object(col("data"),"$.zip")\
         .alias("zip"))\
         .show(truncate=False)

# The output shouldn't be null?

#+-------+----+
#|     id| zip|
#+-------+----+
#|0123456|null|
#|0123456|null|
#+-------+----+

schema = StructType(
    [
        StructField('zip', StringType(), True),
        StructField('phnumber', StringType(), True),
        StructField('name', StringType(), True)
    ]
)

data_json = df.withColumn("data", from_json("data", schema))\
            .select(col('id'), col('data.*'))

# The df output shouldn't be null for the new json schema?

data_json.show()

#+-------+----+---------+-----+
#|     id| zip| phnumber| name|
#+-------+----+---------+-----+
#|0123456|null|     null| null|
#|0123456|null|     null| null|
#+-------+----+---------+-----+

Answer 1

data列實際上包含一個 json數組，因此架構必須是ArrayType ：

schema = ArrayType(
    elementType = StructType(
        [
            StructField('zip', StringType(), True),
            StructField('phnumber', StringType(), True),
            StructField('name', StringType(), True)
        ]
    )
)
data_json = df.withColumn("data", F.from_json("data", schema))

這導致以下架構：

root
 |-- id: long (nullable = true)
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- zip: string (nullable = true)
 |    |    |-- phnumber: string (nullable = true)
 |    |    |-- name: string (nullable = true)

現在，如果您希望數組的每個元素都在單獨的行中，您可以分解它並提取您需要的字段：

data_json = df.withColumn("data", F.from_json("data", schema)) \
    .withColumn("data", F.explode("data")) \
    .select(F.col('id'), F.col('data.*'))

結果：

+---+-----+----------+----+
| id|  zip|  phnumber|name|
+---+-----+----------+----+
|  1|78910|5678910123|   -|
|  1|78920|5678910123|   -|
+---+-----+----------+----+

Pyspark 'from_json', dataframe 為所有 json 列返回 null

問題描述

1 個解決方案

解決方案1
0 2022-11-24 18:26:17

Pyspark 'from_json', dataframe 為所有 json 列返回 null

問題描述

1 個解決方案

解決方案1 0 2022-11-24 18:26:17

解決方案1
0 2022-11-24 18:26:17