pyspark 數據框獲得每行的第二低值

Question

我想查詢，如果有人有想法，如何在 pyspark 中獲得一行 Dataframe 中的第二低值。

例如：

輸入數據框：

Col1  Col2  Col3  Col4 
83    32    14    62   
63    32    74    55   
13    88     6    46

預期輸出：

Col1  Col2  Col3  Col4 Res
83    32    14    62   32   
63    32    74    55   55   
13    88     6    46   13

Answer 1

我們可以使用concat_ws函數連接該行的所有列，然后使用split創建一個數組。

使用array_sort函數在數組中排序並提取數組的second element[1] 。

Example:

from pyspark.sql.functions import *

df=spark.createDataFrame([('83','32','14','62'),('63','32','74','55'),('13','88','6','46')],['Col1','Col2','Col3','Col4'])

df.selectExpr("array_sort(split(concat_ws(',',Col1,Col2,Col3,Col4),','))[1] Res").show()

#+---+
#|Res|
#+---+
#|32 |
#|55 |
#|13 |
#+---+

More Dynamic Way:

df.selectExpr("array_sort(split(concat_ws(',',*),','))[1]").show()

#+---+
#|Res|
#+---+
#|32 |
#|55 |
#|13 |
#+---+

EDIT:

#adding Res column to the dataframe
df1=df.selectExpr("*","array_sort(split(concat_ws(',',*),','))[1] Res")
df1.show()

#+----+----+----+----+---+
#|Col1|Col2|Col3|Col4|Res|
#+----+----+----+----+---+
#|  83|  32|  14|  62| 32|
#|  63|  32|  74|  55| 55|
#|  13|  88|   6|  46| 46|
#+----+----+----+----+---+

Answer 2

您可以使用array函數創建一個數組列，然后使用array_sort對其進行array_sort 。 最后，使用element_at獲取第二個元素。 這最后兩個函數可從 Spark 2.4+ 獲得。

df.withColumn("res", element_at(array_sort(array(*[col(c) for c in df.columns])), 2))\
  .show()

#+----+----+----+----+---+
#|Col1|Col2|Col3|Col4|res|
#+----+----+----+----+---+
#|83  |32  |14  |62  |32 |
#|63  |32  |74  |55  |55 |
#|13  |88  |6   |46  |13 |
#+----+----+----+----+---+

另一種做法是使用least函數。 首先，計算所有列的最小值，然后使用when表達式從大於min值計算另一個時間：

df.withColumn("min", least(*[col(c) for c in df.columns]))\
  .withColumn("res", least(*[when(col(c) > col("min"), col(c)) for c in df.columns]))\
  .drop("min")\
  .show()

pyspark 數據框獲得每行的第二低值

問題描述

2 個解決方案

解決方案1
2 已采納 2020-03-02 21:25:02

解決方案2
1 2020-03-03 12:34:20

pyspark 數據框獲得每行的第二低值

問題描述

2 個解決方案

解決方案1 2 已采納 2020-03-02 21:25:02

解決方案2 1 2020-03-03 12:34:20

解決方案1
2 已采納 2020-03-02 21:25:02

解決方案2
1 2020-03-03 12:34:20