spark scala 數據框中某些特定列的最大值

Question

我有一個這樣的數據框。

+---+---+---+---+
|  M| c2| c3| d1|
+---+---+---+---+
|  1|2_1|4_3|1_2|
|  2|3_4|4_5|1_2|
+---+---+---+---+

我必須轉換這個 df 應該如下所示。 這里， c_max = max(c2,c3)用_分割后，即，所有列（ c2和c3 ）都必須用_分割，然后得到最大值。

在實際場景中，我有 50 列，即c2,c3....c50並且需要從中獲取最大值。

+---+---+---+---+------+
|  M| c2| c3| d1|c_Max |
+---+---+---+---+------+
|  1|2_1|4_3|1_2|  4   |
|  2|3_4|4_5|1_2|  5   |
+---+---+---+---+------+

Answer 1

以下是對 Spark >= 2.4.0 使用expr和內置數組函數的一種方法：

import org.apache.spark.sql.functions.{expr, array_max, array}

val df = Seq(
  (1, "2_1", "3_4", "1_2"),
  (2, "3_4", "4_5", "1_2")
).toDF("M", "c2", "c3", "d1")

// get max c for each c column 
val c_cols = df.columns.filter(_.startsWith("c")).map{ c =>
  expr(s"array_max(cast(split(${c}, '_') as array<int>))")
}

df.withColumn("max_c", array_max(array(c_cols:_*))).show

輸出：

+---+---+---+---+-----+
|  M| c2| c3| d1|max_c|
+---+---+---+---+-----+
|  1|2_1|3_4|1_2|    4|
|  2|3_4|4_5|1_2|    5|
+---+---+---+---+-----+

對於舊版本，請使用下一個代碼：

val c_cols = df.columns.filter(_.startsWith("c")).map{ c =>
  val c_ar = split(col(c), "_").cast("array<int>")
  when(c_ar.getItem(0) > c_ar.getItem(1), c_ar.getItem(0)).otherwise(c_ar.getItem(1))
}

df.withColumn("max_c", greatest(c_cols:_*)).show

Answer 2

使用greatest功能：

val df = Seq((1, "2_1", "3_4", "1_2"),(2, "3_4", "4_5", "1_2"),
).toDF("M", "c2", "c3", "d1")

// get all `c` columns and split by `_` to get the values after the underscore
val c_cols = df.columns.filter(_.startsWith("c"))
                       .flatMap{
                           c => Seq(split(col(c), "_").getItem(0).cast("int"), 
                                    split(col(c), "_").getItem(1).cast("int")
                                )
                        } 

// apply greatest func
val c_max = greatest(c_cols: _*)

// add new column
df.withColumn("c_Max", c_max).show()

給出：

+---+---+---+---+-----+
|  M| c2| c3| d1|c_Max|
+---+---+---+---+-----+
|  1|2_1|3_4|1_2|    4|
|  2|3_4|4_5|1_2|    5|
+---+---+---+---+-----+

Answer 3

在 spark >= 2.4.0 中，您可以使用array_max函數並獲得一些代碼，即使列包含超過 2 個值也可以使用。 這個想法是從連接所有列（ concat列）開始。 為此，我在要連接的所有列的數組上使用concat_ws ，這是我通過array(cols.map(col) :_*) 。 然后我拆分結果字符串以獲得包含所有列的所有值的大字符串數組。 我將它轉換為一個整數數組，並在其上調用array_max 。

val cols = (2 to 50).map("c"+_)

val result = df
    .withColumn("concat", concat_ws("_", array(cols.map(col) :_*)))
    .withColumn("array_of_ints", split('concat, "_").cast(ArrayType(IntegerType)))
    .withColumn("c_max", array_max('array_of_ints))
    .drop("concat", "array_of_ints")

在 spark < 2.4 中，您可以像這樣自己定義 array_max：

val array_max = udf((s : Seq[Int]) => s.max)

前面的代碼不需要修改。 但是請注意，UDF 可能比預定義的 spark SQL 函數慢。

spark scala 數據框中某些特定列的最大值

問題描述

3 個解決方案

解決方案1
4 2020-01-02 09:47:52

解決方案2
3 2020-01-02 09:18:07

解決方案3
1 已采納 2020-01-02 09:57:11

spark scala 數據框中某些特定列的最大值

問題描述

3 個解決方案

解決方案1 4 2020-01-02 09:47:52

解決方案2 3 2020-01-02 09:18:07

解決方案3 1 已采納 2020-01-02 09:57:11

解決方案1
4 2020-01-02 09:47:52

解決方案2
3 2020-01-02 09:18:07

解決方案3
1 已采納 2020-01-02 09:57:11