如何拆分對象列表以分隔 pyspark dataframe 中的列

Question

我在 dataframe 中有一個列作為對象列表（結構數組），例如

column: [{key1:value1}, {key2:value2}, {key3:value3}]

我想將此列拆分為單獨的列，其中鍵名作為列名，值作為同一行中的列值。
最終結果如

key1:value1, key2:value2, key3:value3

如何在 pyspark 中實現這一點？

例如

創建 dataframe 的示例數據：

my_new_schema = StructType([
    StructField('id', LongType()),
    StructField('countries', ArrayType(StructType([
        StructField('name', StringType()),
        StructField('capital', StringType())
    ])))
])
l = [(1, [
        {'name': 'Italy', 'capital': 'Rome'},
        {'name': 'Spain', 'capital': 'Madrid'}
    ])
]
    
dz = spark.createDataFrame(l, schema=my_new_schema)
# we have array of structs:
dz.show(truncate=False)
+---+--------------------------------+
|id |countries                       |
+---+--------------------------------+
|1  |[{Italy, Rome}, {Spain, Madrid}]|
+---+--------------------------------+

預期 output：

+---+--------+---------+
|id |Italy   |  Spain  |
+---+------------------+
|1  |Rome    | Madrid  |
+---+--------+---------+

Answer 1

inline countries數組，然后 pivot 國家name列：

import pyspark.sql.functions as F

dz1 = dz.selectExpr(
    "id", 
    "inline(countries)"
).groupBy("id").pivot("name").agg(
    F.first("capital")
)

dz1.show()
#+---+-----+------+
#|id |Italy|Spain |
#+---+-----+------+
#|1  |Rome |Madrid|
#+---+-----+------+

如何拆分對象列表以分隔 pyspark dataframe 中的列

問題描述

1 個解決方案

解決方案1
1 已采納 2021-12-11 15:00:46

如何拆分對象列表以分隔 pyspark dataframe 中的列

問題描述

1 個解決方案

解決方案1 1 已采納 2021-12-11 15:00:46

解決方案1
1 已采納 2021-12-11 15:00:46