Spark - 將包含 JSON 字符串的列從 StringType 轉換為 ArrayType(StringType())

Question

我有一個 dataframe df，其中包含 json 字符串，如下所示，

'''[{"@id":"Party_1","@ObjectID":"Policy_1"},{"@id":"Party_2","@ObjectID":"Policy_2"},{"@id":"Party_3","@ObjectID":"Policy_3"}]'''

df 架構：

root
 |-- col1: string (nullable = true)

如何將其轉換為字符串數組（ArrayType（StringType（）））？

結果應該是這樣的，

['{"@id":"Party_1","@OriginatingObjectID":"Policy_1"}',
 '{"@id":"Party_2","@OriginatingObjectID":"Policy_2"}',
 '{"@id":"Party_3","@OriginatingObjectID":"Policy_3"}']

結果架構：

root
 |-- arr_col: array (nullable = true)
 |          |-- element: string (containsNull = true)

任何幫助，將不勝感激。 謝謝！

Answer 1

您可以使用 from_json function 獲取 json 字段，對值稍作修改，如下所示

data = [
    ('[{"@id":"Party_1","@ObjectID":"Policy_1"},{"@id":"Party_2","@ObjectID":"Policy_2"},{"@id":"Party_3","@ObjectID":"Policy_3"}]', 2767),
    ('[{"@id":"Party_1","@ObjectID":"Policy_1"},{"@id":"Party_2","@ObjectID":"Policy_2"},{"@id":"Party_3","@ObjectID":"Policy_3"}]', 4235)
]

df = spark.createDataFrame(data).toDF(*["value", "count"])\
    .withColumn("value", f.regexp_replace(f.col("value"), "\\[\\{", "{\"arr\": [{"))\
    .withColumn("value", f.regexp_replace(f.col("value"), "\\}\\]", "}]}"))


json_schema = spark.read.json(df.rdd.map(lambda row: row.value)).schema
resultDF = df.select(f.from_json("value", 
schema=json_schema).alias("array_col"))\
    .select("array_col.*")

resultDF.printSchema()
resultDF.show(truncate=False)

或者，如果您想將嵌套的 json 作為字符串，您可以使用自定義架構。

Output 架構：

root
 |-- arr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- @ObjectID: string (nullable = true)
 |    |    |-- @id: string (nullable = true)

Output：

+---------------------------------------------------------------+
|arr                                                            |
+---------------------------------------------------------------+
|[{Policy_1, Party_1}, {Policy_2, Party_2}, {Policy_3, Party_3}]|
|[{Policy_1, Party_1}, {Policy_2, Party_2}, {Policy_3, Party_3}]|
+---------------------------------------------------------------+

Spark - 將包含 JSON 字符串的列從 StringType 轉換為 ArrayType(StringType())

問題描述

1 個解決方案

解決方案1
1 已采納 2021-06-07 08:54:21

Spark - 將包含 JSON 字符串的列從 StringType 轉換為 ArrayType(StringType())

問題描述

1 個解決方案

解決方案1 1 已采納 2021-06-07 08:54:21

解決方案1
1 已采納 2021-06-07 08:54:21