spark udf 没有被调用

Question

给定以下示例：

import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions._

val testUdf: UserDefinedFunction = udf((a: String, b: String, c: Int) => { 
  val out = s"test1: $a $b $c"
  println(out)
  out
})

val testUdf2: UserDefinedFunction = udf((a: String, b: String, c: String) => { 
  val out = s"test2: $a $b $c"
  println(out)
  out
})

Seq(("hello", "world", null))
.toDF("a", "b", "c")
.withColumn("c", $"c" cast "Int")
.withColumn("test1", testUdf($"a", $"b", $"c"))
.withColumn("test2", testUdf2($"a", $"b", $"c"))
.show

testUdf似乎没有被调用。 没有错误，没有警告，它只是返回 null。

有没有办法检测这些静默故障？ 另外，这里发生了什么？

火花 2.4.4 Scala 2.11

Answer 1

Scala 类型“Int”不允许空值。 变量“c”类型可以更改为“Integer”。

Answer 2

我不知道这是什么原因造成的。 但我认为这很可能是因为隐式转换

代码1

    val spark = SparkSession.builder()
      .master("local")
      .appName("test")
      .getOrCreate()
    import spark.implicits._
    val testUdf: UserDefinedFunction = udf((a: String, b: String, c: Int) => {
      val out = s"test1: $a $b $c"
      println(out)
      out
    })
    
    Seq(("hello", "world", null))
      .toDF("a", "b", "c")
      .withColumn("test1", testUdf($"a", $"b", $"c"))
      .show

代码2

    val spark = SparkSession.builder()
      .master("local")
      .appName("test")
      .getOrCreate()
    import spark.implicits._
    val testUdf: UserDefinedFunction = udf((a: String, b: String, c: String) => {
      val out = s"test1: $a $b $c"
      println(out)
      out
    })

    Seq(("hello", "world", null))
      .toDF("a", "b", "c")
      .withColumn("test1", testUdf($"a", $"b", $"c"))
      .show

code1 逻辑计划

code2 逻辑计划

Answer 3

You should have a scala.MatchError: scala.Null error when you try to cast to null, besides your definition of UDF doesn't work for me as I got a java.lang.UnsupportedOperationException: Schema for type AnyRef is not supported when I尝试注册它。

那这个呢：

import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions._

def testUdf(a: String, b: String, c: Integer): String = { 
  val out = s"test1: $a $b $c"
  println(out)
  out
}

def testUdf2(a: String, b: String, c: String): String = { 
  val out = s"test2: $a $b $c"
  println(out)
  out
}

val yourTestUDF = udf(testUdf _)
val yourTestUDF2 = udf(testUdf2 _)

// spark.udf.register("yourTestUDF", yourTestUDF) // just in case you need it in SQL

spark.createDataFrame(Seq(("hello", "world", null.asInstanceOf[Integer])))
.toDF("a", "b", "c")
.withColumn("test1", yourTestUDF($"a", $"b", $"c"))
.withColumn("test2", yourTestUDF2($"a", $"b", $"c"))
.show(false)

Output：

test1: hello world null
test2: hello world null
+-----+-----+----+-----------------------+-----------------------+
|a    |b    |c   |test1                  |test2                  |
+-----+-----+----+-----------------------+-----------------------+
|hello|world|null|test1: hello world null|test2: hello world null|
+-----+-----+----+-----------------------+-----------------------+

spark udf 没有被调用

问题描述

3 个解决方案

解决方案1
5 2020-06-30 07:04:38

解决方案2
1 已采纳

解决方案3
0 2020-06-30 10:24:53

spark udf 没有被调用

问题描述

3 个解决方案

解决方案1 5 2020-06-30 07:04:38

解决方案2 1 已采纳

解决方案3 0 2020-06-30 10:24:53

解决方案1
5 2020-06-30 07:04:38

解决方案2
1 已采纳

解决方案3
0 2020-06-30 10:24:53