Serialize Lucene StandardAnalyzer for Apache Spark RDD map transformation

Question

I have a String SparkRDD (named RDD1) generated from a HDFS file. And I also have a list of String as dictionary. I want to apply a map function on RDD1 so that for each line of string, I do a search on top of a Lucene index built from the dictionary and return the top three match for each line. I'm using Lucene's TopScoreDocCollector to achieve this. I have no problems with the single machine version but once I run it on cluster it reports:

ThrowableSerializationWrapper: Task exception could not be deserialized java.lang.ClassNotFoundException: org.apache.lucene.queryparser.classic.ParseException

My program logic is first create a broadcast variable from the dictionary(a string list). Then in the map function. I build a Lucene Index from that broadcast variable. I believe the error happens when I called:

StandardAnalyzer analyzer = new StandardAnalyzer();

I believe this is not caused by forgot to add Lucene jars. I'm using the following program to run it.

spark-submit --class jinxuanw.clairvoyant.App --jars lucene-analyzers-common-5.3.1.jar,lucene-core-5.3.1.jar,lucene-queryparser-5.3.1.jar jobtitlematch-1.0.jar

Answer 1

Unfortunately, StandardAnalyzer is not Serializable and hence cannot move such objects from the driver to the executor. Nevertheless, it is possible to instantiate such objects in the executors bypassing the serialisation issue.

Serialize Lucene StandardAnalyzer for Apache Spark RDD map transformation

Question

1 answers

solution1
0 2017-10-13 11:01:37

Serialize Lucene StandardAnalyzer for Apache Spark RDD map transformation

Question

1 answers

solution1 0 2017-10-13 11:01:37

solution1
0 2017-10-13 11:01:37