[英]Serialize Lucene StandardAnalyzer for Apache Spark RDD map transformation
I have a String SparkRDD (named RDD1) generated from a HDFS file. 我有一个从HDFS文件生成的String SparkRDD(名为RDD1)。 And I also have a list of String as dictionary. 我还有一个字符串列表作为字典。 I want to apply a map function on RDD1 so that for each line of string, I do a search on top of a Lucene index built from the dictionary and return the top three match for each line. 我想在RDD1上应用map函数,这样对于每一行字符串,我在从字典构建的Lucene索引之上进行搜索,并返回每行的前三个匹配。 I'm using Lucene's TopScoreDocCollector to achieve this. 我正在使用Lucene的TopScoreDocCollector实现这一目标。 I have no problems with the single machine version but once I run it on cluster it reports: 我对单机版本没有任何问题,但是一旦我在集群上运行它报告:
ThrowableSerializationWrapper: Task exception could not be deserialized java.lang.ClassNotFoundException: org.apache.lucene.queryparser.classic.ParseException
My program logic is first create a broadcast variable from the dictionary(a string list). 我的程序逻辑首先从字典(字符串列表)创建一个广播变量。 Then in the map function. 然后在map函数中。 I build a Lucene Index from that broadcast variable. 我从该广播变量构建Lucene索引。 I believe the error happens when I called: 我相信当我打电话时会发生错误:
StandardAnalyzer analyzer = new StandardAnalyzer();
I believe this is not caused by forgot to add Lucene jars. 我相信这不是因忘记添加Lucene罐子造成的。 I'm using the following program to run it. 我正在使用以下程序来运行它。
spark-submit --class jinxuanw.clairvoyant.App --jars lucene-analyzers-common-5.3.1.jar,lucene-core-5.3.1.jar,lucene-queryparser-5.3.1.jar jobtitlematch-1.0.jar
Unfortunately, StandardAnalyzer
is not Serializable and hence cannot move such objects from the driver to the executor. 遗憾的是, StandardAnalyzer
不是Serializable,因此无法将此类对象从驱动程序移动到执行程序。 Nevertheless, it is possible to instantiate such objects in the executors bypassing the serialisation issue. 然而,可以绕过序列化问题在执行程序中实例化这样的对象。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.