简体   繁体   中英

Spark throws java.lang.NullPointerException when mapping rdd with java phonetic matching library on null values

I have an RDD that I turned from a DataFrame using map:

case class Record(id_1: Int, fnam_1: String, lnam_1: String, id_2: Long, fnam_2: String, lnam_2: String)
val rdd = df.map {
  case Row(id_1: Int, fnam_1: String, lnam_1: String, id_2: Long, fnam_2: String, lnam_2: String) =>
    Record(id_1, fnam_1, lnam_1, id_2, fnam_2, lnam_2)
}

Then I perform a filter operation on this rdd using a java phonetic matching library (as shown below):

import edu.ualr.oyster.utilities.DoubleMetaphone

def matchFirstName(rec: Record) = {
  val s1 = Option(rec.fnam_1).getOrElse("")
  val s2 = Option(rec.fnam_2).getOrElse("")
  if (s1.isEmpty || s2.isEmpty)
    false
  else
    new DoubleMetaphone().compareDoubleMetaphone(s1, s2)
}

val rdd_filtered = rdd.filter(matchFirstName(_))

When I run this, I get an NPE error:

17/04/06 19:06:31 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 (TID 160, my.work.cluster.com): java.lang.NullPointerException
    at edu.ualr.oyster.utilities.DoubleMetaphone.compareDoubleMetaphone(DoubleMetaphone.java:1020)
    at funpackage.EntityResolution$.phoneticMatching(EntityResolution.scala:106)
    at esurance.EntityResolution$.esurance$EntityResolution$$matchNames$1(EntityResolution.scala:118)
    at esurance.EntityResolution$$anonfun$8.apply(EntityResolution.scala:137)
    at esurance.EntityResolution$$anonfun$8.apply(EntityResolution.scala:137)
    at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

I've tried to use the phonetic matching on a pair of strings in the project and it did work without problems. I've also used the same library in spark sql wrapped in a user defined function with no problems. I suspect that the problem could be caused by the fact that some of my values can be missing (null). But I tried to take care of that with the Option in there. Any idea why this is failing?

I did not try to dig into the edu.ualr.oyster library to see if it was causing the exception. But it seems to be the case. I switched to use org.apache.commons.codec.language library (same double metaphone function) and the program works on spark with no problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM