我如何在 pyspark 中解壓類型列表的列

Question

我在pyspark中有一個數據框，df有一個數組字符串類型的列，所以我需要生成一個帶有列表頭部的新列，並且我還需要其他帶有尾部列表連接的列。

這是我的原始數據框：

pyspark> df.show()
+---+------------+
| id|     lst_col|
+---+------------+
|  1|[a, b, c, d]|
+---+------------+


pyspark> df.printSchema()
root
 |-- id: integer (nullable = false)
 |-- lst_col: array (nullable = true)
 |    |-- element: string (containsNull = true)

我需要生成這樣的東西：

pyspark> df2.show()
+---+--------+---------------+
| id|lst_head|lst_concat_tail|
+---+--------+---------------+
|  1|       a|          b,c,d|
+---+--------+---------------+

Answer 1

對於 Spark 2.4+，您可以對數組使用element_at 、 slice和size函數：

df.select("id",
          element_at("lst_col", 1).alias("lst_head"),
          expr("slice(lst_col, 2, size(lst_col))").alias("lst_concat_tail")
         ).show()

給出：

+---+--------+---------------+
| id|lst_head|lst_concat_tail|
+---+--------+---------------+
|  1|       a|      [b, c, d]|
+---+--------+---------------+

我如何在 pyspark 中解壓類型列表的列

問題描述

1 個解決方案

解決方案1
2 已采納 2020-01-24 20:20:03

我如何在 pyspark 中解壓類型列表的列

問題描述

1 個解決方案

解決方案1 2 已采納 2020-01-24 20:20:03

解決方案1
2 已采納 2020-01-24 20:20:03