將UDF應用於Spark Dataframe中的多個列

Question

我有一個如下所示的數據框

| id| age|   rbc|  bgr| dm|cad|appet| pe|ane|classification|
+---+----+------+-----+---+---+-----+---+---+--------------+
|  3|48.0|normal|117.0| no| no| poor|yes|yes|           ckd|
....
....
....

我寫了一個UDF來將yes, no, poor, normal分類轉換為二進制0s和1s

def stringToBinary(stringValue: String): Int = {
    stringValue match {
        case "yes" => return 1
        case "no" => return 0
        case "present" => return 1
        case "notpresent" => return 0
        case "normal" => return 1
        case "abnormal" => return 0
    }
}

val stringToBinaryUDF = udf(stringToBinary _)

我將其應用於數據框如下

val newCol = stringToBinaryUDF.apply(col("pc")) //creates the new column with formatted value
val refined1 = noZeroDF.withColumn("dm", newCol) //adds the new column to original

如何將多個列傳遞到UDF，這樣我就不必為其他分類列重復我自己了？

Answer 1

udf的功能不應該是選擇，如果你有spark功能，做同樣的工作作為udf功能將序列和反序列化列數據。

給定一個dataframe為

+---+----+------+-----+---+---+-----+---+---+--------------+
|id |age |rbc   |bgr  |dm |cad|appet|pe |ane|classification|
+---+----+------+-----+---+---+-----+---+---+--------------+
|3  |48.0|normal|117.0|no |no |poor |yes|yes|ckd           |
+---+----+------+-----+---+---+-----+---+---+--------------+

您可以將when函數用作

import org.apache.spark.sql.functions._
def applyFunction(column : Column) = when(column === "yes" || column === "present" || column === "normal", lit(1))
  .otherwise(when(column === "no" || column === "notpresent" || column === "abnormal", lit(0)).otherwise(column))

df.withColumn("dm", applyFunction(col("dm")))
  .withColumn("cad", applyFunction(col("cad")))
  .withColumn("rbc", applyFunction(col("rbc")))
  .withColumn("pe", applyFunction(col("pe")))
  .withColumn("ane", applyFunction(col("ane")))
  .show(false)

結果是

+---+----+---+-----+---+---+-----+---+---+--------------+
|id |age |rbc|bgr  |dm |cad|appet|pe |ane|classification|
+---+----+---+-----+---+---+-----+---+---+--------------+
|3  |48.0|1  |117.0|0  |0  |poor |1  |1  |ckd           |
+---+----+---+-----+---+---+-----+---+---+--------------+

現在，問題清楚地表明，您不想為所有列重復該過程，因此可以執行以下操作

val columnsTomap = df.select("rbc", "cad", "rbc", "pe", "ane").columns

var tempdf = df
columnsTomap.map(column => {
  tempdf = tempdf.withColumn(column, applyFunction(col(column)))
})

tempdf.show(false)

Answer 2

UDF可以采用許多參數，即多列，但它應返回一個結果，即一列。

為此，只需將參數添加到stringToBinary函數即可。

如果您希望它占據兩列，它將看起來像這樣：

def stringToBinary(stringValue: String, secondValue: String): Int = {
stringValue match {
    case "yes" => return 1
    case "no" => return 0
    case "present" => return 1
    case "notpresent" => return 0
    case "normal" => return 1
    case "abnormal" => return 0
}
}
val stringToBinaryUDF = udf(stringToBinary _)

希望這可以幫助

Answer 3

您也可以使用foldLeft函數。 將您的UDF稱為stringToBinaryUDF ：

import org.apache.spark.sql.functions._

val categoricalColumns = Seq("rbc", "cad", "rbc", "pe", "ane")
val refinedDF = categoricalColumns
    .foldLeft(noZeroDF) { (accumulatorDF: DataFrame, columnName: String) =>
         accumulatorDF
            .withColumn(columnName, stringToBinaryUDF(col(columnName)))
     }

這將尊重不變性和功能性編程。

將UDF應用於Spark Dataframe中的多個列

問題描述

3 個解決方案

解決方案1
10 已采納 2017-07-19 12:04:16

解決方案2
1 2017-07-19 11:31:05

解決方案3
0 2019-06-20 18:59:54

將UDF應用於Spark Dataframe中的多個列

問題描述

3 個解決方案

解決方案1 10 已采納 2017-07-19 12:04:16

解決方案2 1 2017-07-19 11:31:05

解決方案3 0 2019-06-20 18:59:54

解決方案1
10 已采納 2017-07-19 12:04:16

解決方案2
1 2017-07-19 11:31:05

解決方案3
0 2019-06-20 18:59:54