[英]Stanford LexicalizedParser throws NPE when using in spark
我正在嘗試在Spark RDD映射函數中使用斯坦福大學的LexicalizedParser。
該算法大致如下:
val parser = LexicalizedParser.loadModel(englishPCFG.ser.gz)
val parserBroadcast = sparkContext.broadcast(parser) // using Kryo serializer here
someSparkRdd.map { case sentence: List[HasWord] =>
parserBroadcast.value.parse(sentence) //NPE is being thrown see below
}
我想實例化解析器一次(在地圖外部)然后只廣播它的原因是,該地圖迭代了將近一百萬個句子,Java垃圾回收器產生了過多的開銷,並且整個處理速度明顯降低。
執行map語句后,將引發以下NullPointerException:
java.lang.NullPointerException
at edu.stanford.nlp.parser.lexparser.BaseLexicon.isKnown(BaseLexicon.java:152)
at edu.stanford.nlp.parser.lexparser.BaseLexicon.ruleIteratorByWord(BaseLexicon.java:208)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.initializeChart(ExhaustivePCFGParser.java:1343)
at edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.parse(ExhaustivePCFGParser.java:457)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parseInternal(LexicalizedParserQuery.java:258)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.parse(LexicalizedParserQuery.java:536)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.parse(LexicalizedParser.java:301)
at my.class.NounPhraseExtractionWithStanford$$anonfun$extractNounPhrases$3.apply(NounPhraseExtractionWithStanford.scala:28)
at my.class.NounPhraseExtractionWithStanford$$anonfun$extractNounPhrases$3.apply(NounPhraseExtractionWithStanford.scala:27)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at my.class.NounPhraseExtractionWithStanford$.extractNounPhrases(NounPhraseExtractionWithStanford.scala:27)
at my.class.HBaseDocumentProducerWithStanford$$anonfun$produceDocumentTokens$3.apply(HBaseDocumentProducerWithStanford.scala:104)
at my.class.HBaseDocumentProducerWithStanford$$anonfun$produceDocumentTokens$3.apply(HBaseDocumentProducerWithStanford.scala:104)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$15.apply(PairRDDFunctions.scala:674)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$15.apply(PairRDDFunctions.scala:674)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:172)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:79)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
在源代碼中,我看到這顯然是因為edu.stanford.nlp.parser.lexparser.BaseLexicon的許多瞬時類變量,在廣播過程中(使用Kryo序列化程序)執行的SerDe使BaseLexicon進行了半初始化。
我意識到LexParser的開發人員在設計時並沒有想到火花,但是我仍然非常感謝有關如何在我的場景中使用它的任何提示(即使用spark)。
一種可能的解決方法,不是100%保證它會起作用:
class ParseSentence extends (List[HasWord] => WhateverParseReturns) with Serializable {
def apply(sentence: List[HasWord]) = ParseSentence.parser.parse(sentence)
}
object ParseSentence {
val parser = LexicalizedParser.loadModel(englishPCFG.ser.gz)
}
someSparkRdd.map(new ParseSentence)
這種parser
不需要序列化/反序列化,因為它不會被捕獲為功能對象的字段。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.