如何在Spark UDF中使用Option

Question

我有一個像這樣的數據集：

+----+------+
|code|status|
+-----------+
|   1| "new"|
|   2|  null|
|   3|  null|
+----+------+

我想寫一個依賴於兩列的UDF。

我按照此答案中的第二種方法來工作，該方法是在UDF外部處理null ，並編寫myFn以將Boolean作為第二個參數：

df.withColumn("new_column",
  when(pst_regs("status").isNull, 
    myFnUdf($"code", lit(false))
  )
  .otherwise(
    myFnUdf($"code", lit(true))
  )
)

為了處理UDF中的null，我在此答案中介紹了一種方法，該方法涉及“用Options包裹參數”。 我試過這樣的代碼：

df.withColumn("new_column", myFnUdf($"code", $"status"))

def myFn(code: Int, status: String) = (code, Option(status)) match {
  case (1, "new") => "1_with_new_status"
  case (2, Some(_)) => "2_with_any_status"
  case (3, None) => "3_no_status"
}

但是具有null的行給出type mismatch; found :None.type required String type mismatch; found :None.type required String 。 我還嘗試在udf創建期間用Option包裹一個參數，但沒有成功。 其基本形式（不帶選件）如下所示：

myFnUdf = udf[String, Int, String](myFn(_:Int, _:String))

我是Scala的新手，所以我確定我缺少一些簡單的東西。 我困惑的部分原因可能是從函數創建udf的語法不同（例如，按照https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html ），所以我不確定我正在使用最好的方法。 任何幫助表示贊賞！

編輯

編輯以根據@ user6910411和@sgvd注釋添加缺少的(1, "new")大小寫。

Answer 1

首先，您正在使用的某些代碼可能在這里丟失了。 當我嘗試將示例myFn轉換為val myFnUdf = udf(myFn _)的UDF並使用df.withColumn("new_column", myFnUdf($"code", $"status")).show ，我不知道不會得到類型不匹配，而是出現MatchError ，如user6910411所述。 這是因為沒有要匹配的模式(1, "new") 。

除此之外，盡管通常最好使用Scala的Options而不是原始的null值，但在這種情況下您不必這樣做。 以下示例直接使用null ：

val my_udf = udf((code: Int, status: String) => status match {
    case null => "no status"
    case _ => "with status"
})

df.withColumn("new_column", my_udf($"code", $"status")).show

結果：

+----+------+-----------+
|code|status| new_column|
+----+------+-----------+
|   1|   new|with status|
|   2|  null|  no status|
|   2|  null|  no status|
+----+------+-----------+

用選項包裝仍然有效：

val my_udf = udf((code: Int, status: String) => Option(status) match {
    case None => "no status"
    case Some(_) => "with status"
})

這給出了相同的結果。

如何在Spark UDF中使用Option

問題描述

1 個解決方案

解決方案1
3 已采納 2016-12-15 10:21:11

如何在Spark UDF中使用Option

問題描述

1 個解決方案

解決方案1 3 已采納 2016-12-15 10:21:11

解決方案1
3 已采納 2016-12-15 10:21:11