將包含結構的 Dataframe 列拆分為新列

Question

我有一個名為 df 的 Dataframe 具有以下結構：

root
|-- country: string (nullable = true)
|-- competition: string (nullable = true)
|-- competitor: array (nullable = true)
    |-- element: struct (containsNull = true)
    |   |-- name: string (nullable = true)
        |-- time: string (nullable = true)

看起來像這樣：

|country|competiton|competitor      |
|___________________________________| 
|USA    |WN        |[{Adam, 9.43}]  |
|China  |FN        |[{John, 9.56}]  |
|China  |FN        |[{Adam, 9.48}]  |
|USA    |MNU       |[{Phil, 10.02}] |
|...    |...       |...             |

我想根據每個結構中的值將 pivot（或類似的東西）競爭對手列到新列中，所以它看起來像這樣：

|country|competition|Adam|John|Phil  |
|____________________________________| 
|USA    |WN         |9.43|... |...   |
|China  |FN         |9.48|9.56|...   |
|USA    |MNU        |... |... |10.02 |

這些名稱是唯一的，所以如果列已經存在，我不想創建一個新的，而是在已經創建的一個中填充值。 名稱很多，因此需要動態完成。

我有一個相當大的數據集，所以我不能使用 Pandas。

Answer 1

我們可以從數組中提取結構，然后從該結構中創建新列。 name列可以使用time作為值進行旋轉。

data_sdf. \
    withColumn('name_time_struct', func.col('competitor')[0]). \
    select('country', 'competition', func.col('name_time_struct.*')). \
    groupBy('country', 'competition'). \
    pivot('name'). \
    agg(func.first('time')). \
    show()

+-------+-----------+----+----+-----+
|country|competition|Adam|John| Phil|
+-------+-----------+----+----+-----+
|    USA|        MNU|null|null|10.02|
|    USA|         WN|9.43|null| null|
|  China|         FN|9.48|9.56| null|
+-------+-----------+----+----+-----+

PS這假設數組中只有1個結構。

將包含結構的 Dataframe 列拆分為新列

問題描述

1 個解決方案

解決方案1
0 已采納 2022-07-28 14:58:49

將包含結構的 Dataframe 列拆分為新列

問題描述

1 個解決方案

解決方案1 0 已采納 2022-07-28 14:58:49

解決方案1
0 已采納 2022-07-28 14:58:49