为什么此列表[String]到数据帧会在Spark Scala中引发NullPointerException？

Question

The following codesnippet is causing NullPointerException. 以下代码片段导致NullPointerException。 I am not sure, if this exception is happening on some rows or always as the dataframe is huge and not able to pin point the row. 我不确定，如果异常发生在某些行上还是总是发生，因为数据帧很大并且无法查明该行。

def removeUnwantedLetters(str: String): String = {
    str.split("\\W+").filter(word => (word.matches("[a-z]+") && (word.length > 1))).mkString(" ")
}

val myudf = spark.udf.register("learningUDF", (f1: String, f2: String) => {
    if(f1 != null && f2 != null) {
        val preproList = List(removeUnwantedLetters(f2.toLowerCase));

        if(preproList > 0) {
            val key_items = preproList.toDF("Description")
        }
    }

    (1, 1)
})



mydataframe.withColumn("pv", myudf($"f1", $"f2")).show

The entire code is huge, so sorry for not pasting the entire code here, tried my best to minimize the failing code here. 整个代码很大，所以很抱歉没有在此处粘贴整个代码，请尽我最大的努力将此处失败的代码最小化。 Following is the exception that I am getting on the actual code: 以下是我遇到的实际代码的例外情况：

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 274.0 failed 4 times, most recent failure: Lost task 0.3 in stage 274.0 (TID 23387, 10.62.145.186, executor 2): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (string, string, string, string, string, string, string, string, string, string, string, string) => struct<_1:int,_2:double>)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ScalaUDF$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_26$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$10.next(Iterator.scala:393)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:254)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
    at $anonfun$1.apply(<console>:100)
    at $anonfun$1.apply(<console>:82)
    ... 22 more

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2048)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2067)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
  ... 66 elided
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (string, string, string, string, string, string, string, string, string, string, string, string) => struct<_1:int,_2:double>)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ScalaUDF$(Unknown Source)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_26$(Unknown Source)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$10.next(Iterator.scala:393)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:254)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
  ... 3 more
Caused by: java.lang.NullPointerException
  at $anonfun$1.apply(<console>:100)
  at $anonfun$1.apply(<console>:82)
  ... 22 more

By trial and error I have found out that this line val key_items = preproList.toDF("Description") is causing the NPE. 通过反复试验，我发现这行val key_items = preproList.toDF("Description")导致了NPE。 Because if I change it simply to val key_items = preproList , it works fine. 因为如果我仅将其更改为val key_items = preproList ，它就可以正常工作。

Can anyone please let me know when would `val key_items = preproList.toDF("Description")` give a `NullPointerException`.

Update 更新资料

It seems like we cannot create a dataframe inside a UDF. 似乎我们无法在UDF中创建数据框。 Because I tried replacing val key_items = preproList.toDF("Description") with val key_items = List(1,2,3,4).toDF("VL") . 因为我尝试用val key_items = preproList.toDF("Description") val key_items = List(1,2,3,4).toDF("VL")替换val key_items = preproList.toDF("Description") val key_items = List(1,2,3,4).toDF("VL") 。 And to my surprise, it failed with the same exception. 令我惊讶的是，它以相同的例外失败了。

Is it not possible to create a temporary dataframe inside a UDF? 是否无法在UDF中创建临时数据帧？

Update 2 更新2

I am trying to create a temporary dataframe to use JohnSnowLabs Norvig Spell correction model using its pipeline as follows: 我正在尝试创建一个临时数据框，以通过其管道使用JohnSnowLabs Norvig Spell校正模型，如下所示：

val nlpPipeline = new Pipeline().setStages(Array(
  new DocumentAssembler().setInputCol("Description").setOutputCol("document"),
  new Tokenizer().setInputCols("document").setOutputCol("tokens"),
  norvigspell.setInputCols("tokens").setOutputCol("Description_corrected"),
  new Finisher().setInputCols("Description_corrected")
))

val dbDF = preproList.toDF("Description")

val spellcorrectedDF = dbDF.transform(dbDF=> nlpPipeline.fit(dbDF).transform(dbDF))

Answer 1

The sort answer is: No, you can't create a DataFrame (or Dataset ) inside a UDF . 排序答案是： 不，您不能在UDF内创建一个DataFrame （或Dataset ） 。 UDFs operate on individual row values and so are required to return simple values that can be stored in a new column, think of them as Calculated Columns . UDF对单个行值进行操作，因此需要返回简单的值，这些值可以存储在新列中，可以将它们视为“ 计算列” 。 If you could create a DataFrame inside a UDF, it will only have one row, and you would be creating many of them, one per row of the parent DataFrame . 如果您可以在UDF中创建一个DataFrame ，则它将只有一行，并且您将创建许多行，而在父DataFrame每一行中创建一行 。

Now, from your code is difficult to tell what you want to do, in a way I see you are attempting some sort of character clean up, storing it in a key_items value (as a DataFrame) and never using it... to end up returning a constant (1, 1) pair regardless of the previous computation... The fact that your UDF takes 2 parameters and you only use one is puzzling to me too. 现在，从您的代码很难说出您想要做什么，以某种方式，我看到您正在尝试某种形式的字符清除，将其存储在key_items值（作为DataFrame）中，并且从不使用它...结束不管先前的计算如何，都会返回一个常数(1, 1)对... UDF接受2个参数并且只使用一个参数这一事实也令我感到困惑。

I will guess that you want to compute the description based on the value of one given column (you are only using one) so something like the following will get you something similar: 我猜想您想基于给定列的值（仅使用一列）来计算描述 ，因此类似以下内容的内容将为您提供类似的信息：

def removeUnwantedLetters(str: String): String = {
    str.split("\\W+").filter(word => (word.matches("[a-z]+") && (word.length > 1))).mkString(" ")
}

val myudf = spark.udf.register("learningUDF", (f1: String) => {
    if(f1 != null) {
        removeUnwantedLetters(f2.toLowerCase)
    } else ""
})

// This seems to be the DataFrame you are looking for
val descriptionDF = mydataframe
  .withColumn("Description", myudf($"f2"))
  .select("Description")

With the previous, Spark can create the column Description out of invoking your UDF over all the values of the DataFrame. 使用前一个，Spark可以通过在DataFrame的所有值上调用UDF来创建列Description 。 Then, by using .select("Description") you are creating a new DataFrame that only has the Description column. 然后，通过使用.select("Description")创建一个仅具有Description列的新DataFrame。

为什么此列表[String]到数据帧会在Spark Scala中引发NullPointerException？

问题描述

Update 更新资料

Update 2 更新2

1 个解决方案

解决方案1
2 已采纳 2018-11-30 09:39:34

为什么此列表[String]到数据帧会在Spark Scala中引发NullPointerException？

问题描述

Update 更新资料

Update 2 更新2

1 个解决方案

解决方案1 2 已采纳 2018-11-30 09:39:34

解决方案1
2 已采纳 2018-11-30 09:39:34