序列化Lucene StandardAnalyzer，用于Apache Spark RDD映射转换

Question

I have a String SparkRDD (named RDD1) generated from a HDFS file. 我有一个从HDFS文件生成的String SparkRDD（名为RDD1）。 And I also have a list of String as dictionary. 我还有一个字符串列表作为字典。 I want to apply a map function on RDD1 so that for each line of string, I do a search on top of a Lucene index built from the dictionary and return the top three match for each line. 我想在RDD1上应用map函数，这样对于每一行字符串，我在从字典构建的Lucene索引之上进行搜索，并返回每行的前三个匹配。 I'm using Lucene's TopScoreDocCollector to achieve this. 我正在使用Lucene的TopScoreDocCollector实现这一目标。 I have no problems with the single machine version but once I run it on cluster it reports: 我对单机版本没有任何问题，但是一旦我在集群上运行它报告：

ThrowableSerializationWrapper: Task exception could not be deserialized java.lang.ClassNotFoundException: org.apache.lucene.queryparser.classic.ParseException

My program logic is first create a broadcast variable from the dictionary(a string list). 我的程序逻辑首先从字典（字符串列表）创建一个广播变量。 Then in the map function. 然后在map函数中。 I build a Lucene Index from that broadcast variable. 我从该广播变量构建Lucene索引。 I believe the error happens when I called: 我相信当我打电话时会发生错误：

StandardAnalyzer analyzer = new StandardAnalyzer();

I believe this is not caused by forgot to add Lucene jars. 我相信这不是因忘记添加Lucene罐子造成的。 I'm using the following program to run it. 我正在使用以下程序来运行它。

spark-submit --class jinxuanw.clairvoyant.App --jars lucene-analyzers-common-5.3.1.jar,lucene-core-5.3.1.jar,lucene-queryparser-5.3.1.jar jobtitlematch-1.0.jar

Answer 1

Unfortunately, StandardAnalyzer is not Serializable and hence cannot move such objects from the driver to the executor. 遗憾的是， StandardAnalyzer不是Serializable，因此无法将此类对象从驱动程序移动到执行程序。 Nevertheless, it is possible to instantiate such objects in the executors bypassing the serialisation issue. 然而，可以绕过序列化问题在执行程序中实例化这样的对象。

序列化Lucene StandardAnalyzer，用于Apache Spark RDD映射转换

问题描述

1 个解决方案

解决方案1
0 2017-10-13 11:01:37

序列化Lucene StandardAnalyzer，用于Apache Spark RDD映射转换

问题描述

1 个解决方案

解决方案1 0 2017-10-13 11:01:37

解决方案1
0 2017-10-13 11:01:37