Spark throws java.lang.NullPointerException when mapping rdd with java phonetic matching library on null values

Question

I have an RDD that I turned from a DataFrame using map:

case class Record(id_1: Int, fnam_1: String, lnam_1: String, id_2: Long, fnam_2: String, lnam_2: String)
val rdd = df.map {
  case Row(id_1: Int, fnam_1: String, lnam_1: String, id_2: Long, fnam_2: String, lnam_2: String) =>
    Record(id_1, fnam_1, lnam_1, id_2, fnam_2, lnam_2)
}

Then I perform a filter operation on this rdd using a java phonetic matching library (as shown below):

import edu.ualr.oyster.utilities.DoubleMetaphone

def matchFirstName(rec: Record) = {
  val s1 = Option(rec.fnam_1).getOrElse("")
  val s2 = Option(rec.fnam_2).getOrElse("")
  if (s1.isEmpty || s2.isEmpty)
    false
  else
    new DoubleMetaphone().compareDoubleMetaphone(s1, s2)
}

val rdd_filtered = rdd.filter(matchFirstName(_))

When I run this, I get an NPE error:

17/04/06 19:06:31 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 (TID 160, my.work.cluster.com): java.lang.NullPointerException
    at edu.ualr.oyster.utilities.DoubleMetaphone.compareDoubleMetaphone(DoubleMetaphone.java:1020)
    at funpackage.EntityResolution$.phoneticMatching(EntityResolution.scala:106)
    at esurance.EntityResolution$.esurance$EntityResolution$$matchNames$1(EntityResolution.scala:118)
    at esurance.EntityResolution$$anonfun$8.apply(EntityResolution.scala:137)
    at esurance.EntityResolution$$anonfun$8.apply(EntityResolution.scala:137)
    at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

I've tried to use the phonetic matching on a pair of strings in the project and it did work without problems. I've also used the same library in spark sql wrapped in a user defined function with no problems. I suspect that the problem could be caused by the fact that some of my values can be missing (null). But I tried to take care of that with the Option in there. Any idea why this is failing?

Answer 1

I did not try to dig into the edu.ualr.oyster library to see if it was causing the exception. But it seems to be the case. I switched to use org.apache.commons.codec.language library (same double metaphone function) and the program works on spark with no problem.

Spark throws java.lang.NullPointerException when mapping rdd with java phonetic matching library on null values

Question

1 answers

solution1
0 ACCPTED 2017-04-07 23:38:22

Spark throws java.lang.NullPointerException when mapping rdd with java phonetic matching library on null values

Question

1 answers

solution1 0 ACCPTED 2017-04-07 23:38:22

solution1
0 ACCPTED 2017-04-07 23:38:22