简体   繁体   中英

Serialize Lucene StandardAnalyzer for Apache Spark RDD map transformation

I have a String SparkRDD (named RDD1) generated from a HDFS file. And I also have a list of String as dictionary. I want to apply a map function on RDD1 so that for each line of string, I do a search on top of a Lucene index built from the dictionary and return the top three match for each line. I'm using Lucene's TopScoreDocCollector to achieve this. I have no problems with the single machine version but once I run it on cluster it reports:

ThrowableSerializationWrapper: Task exception could not be deserialized java.lang.ClassNotFoundException: org.apache.lucene.queryparser.classic.ParseException

My program logic is first create a broadcast variable from the dictionary(a string list). Then in the map function. I build a Lucene Index from that broadcast variable. I believe the error happens when I called:

StandardAnalyzer analyzer = new StandardAnalyzer();

I believe this is not caused by forgot to add Lucene jars. I'm using the following program to run it.

spark-submit --class jinxuanw.clairvoyant.App --jars lucene-analyzers-common-5.3.1.jar,lucene-core-5.3.1.jar,lucene-queryparser-5.3.1.jar jobtitlematch-1.0.jar 

Unfortunately, StandardAnalyzer is not Serializable and hence cannot move such objects from the driver to the executor. Nevertheless, it is possible to instantiate such objects in the executors bypassing the serialisation issue.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM