简体   繁体   中英

How to use Option with Spark UDF

I have a Dataset like this:

+----+------+
|code|status|
+-----------+
|   1| "new"|
|   2|  null|
|   3|  null|
+----+------+

I would like to write a UDF that depends on both columns.

I got it working following the second approach in this answer which is to handle null outside the UDF, and write myFn to take a Boolean as the second parameter:

df.withColumn("new_column",
  when(pst_regs("status").isNull, 
    myFnUdf($"code", lit(false))
  )
  .otherwise(
    myFnUdf($"code", lit(true))
  )
)

To handle null in the UDF an approach I looked at is per this answer which talks about "wrapping arguments with Options ". I tried code like this:

df.withColumn("new_column", myFnUdf($"code", $"status"))

def myFn(code: Int, status: String) = (code, Option(status)) match {
  case (1, "new") => "1_with_new_status"
  case (2, Some(_)) => "2_with_any_status"
  case (3, None) => "3_no_status"
}

But a row with null gives type mismatch; found :None.type required String type mismatch; found :None.type required String . I also tried wrapping an argument with Option during udf creation without success. The basic form of this (without Option) looks like this:

myFnUdf = udf[String, Int, String](myFn(_:Int, _:String))

I'm new to Scala so I'm sure I'm missing something simple. Part of my confusion may be the different syntaxes for creating udfs from functions (eg per https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html ), so I'm not sure I'm using the best way. Any help appreciated!

EDIT

Edited to add missing (1, "new") case per @user6910411 and @sgvd comments.

Firstly, there may be some code you are using that we are missing here. When I try your example myFn , made into a UDF with val myFnUdf = udf(myFn _) and run it with df.withColumn("new_column", myFnUdf($"code", $"status")).show , I don't get a type mismatch, but instead a MatchError , as also noted by user6910411. This is becausethere is no pattern to match (1, "new") .

Aside from that, although it is usually better to use Scala's Options rather than raw null values, you don't have to in this case. The following example works with null directly:

val my_udf = udf((code: Int, status: String) => status match {
    case null => "no status"
    case _ => "with status"
})

df.withColumn("new_column", my_udf($"code", $"status")).show

Result:

+----+------+-----------+
|code|status| new_column|
+----+------+-----------+
|   1|   new|with status|
|   2|  null|  no status|
|   2|  null|  no status|
+----+------+-----------+

Wrapping with options does still work though:

val my_udf = udf((code: Int, status: String) => Option(status) match {
    case None => "no status"
    case Some(_) => "with status"
})

This gives the same result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM