简体   繁体   English

coreNLP显着减缓了火花工作

[英]coreNLP significantly slowing down spark job`

I'm attempting to make a spark job that does classification through cutting a document into sentences, and then lemmatizing each word in the sentence for logistic regression. 我试图通过将文档剪切成句子来进行分类,然后将句子中的每个单词进行逻辑回归以进行逻辑回归。 However, I'm finding that stanford's annotation class is causing a SERIOUS bottleneck in my spark job (it's taking 20 minutes to process only 500k documents) 但是,我发现stanford的注释类在我的火花工作中造成了严重的瓶颈(它需要20分钟才能处理500k文件)

Here is the code I'm currently using for sentence parsing and classification 这是我目前用于句子解析和分类的代码

Sentence parsing: 句子解析:

def prepSentences(text: String): List[CoreMap] = {
    val mod = text.replace("Sr.", "Sr") // deals with an edge case
    val doc = new Annotation(mod)
    pipeHolder.get.annotate(doc)
    val sentences = doc.get(classOf[SentencesAnnotation]).toList
    sentences
}

I then take each coremap and process the lemmas as follows 然后,我将采用每个coremap并按如下方式处理引理

def coreMapToLemmas(map:CoreMap):Seq[String] = {
      map.get(classOf[TokensAnnotation]).par.foldLeft(Seq[String]())(
    (a, b) => {
        val lemma = b.get(classOf[LemmaAnnotation])
        if (!(stopWords.contains(b.lemma().toLowerCase) || puncWords.contains(b.originalText())))
      a :+ lemma.toLowerCase
    else a
  }
)
}

Perhaps there's a class that only involves some of the processing? 也许有一个类只涉及一些处理?

Try using CoreNLP's Shift Reduce parser implementation . 尝试使用CoreNLP的Shift Reduce解析器实现

A basic example (typing this without a compiler): 一个基本的例子(没有编译器输入):

val p = new Properties()
p.put("annotators", "tokenize ssplit pos parse lemma sentiment")
// use Shift-Reduce Parser with beam search
// http://nlp.stanford.edu/software/srparser.shtml
p.put("parse.model", "edu/stanford/nlp/models/srparser/englishSR.beam.ser.gz")
val corenlp = new StanfordCoreNLP(props)

val text = "text to annotate"
val annotation = new Annotation(text)
corenlp.annotate(text)

I work on a production system which uses CoreNLP in a Spark processing pipeline. 我在一个生产系统上工作,该系统在Spark处理管道中使用CoreNLP。 Using the Shift Reduce parser with Beam search improved the parsing speed of my pipeline by a factor of 16 and reduced the amount of working memory required for parsing. 使用具有Beam搜索的Shift Reduce解析器将我的管道的解析速度提高了16倍,并减少了解析所需的工作内存量。 The Shift Reduce parser is linear in runtime complexity, which is better than the standard lexicalized PCFG parser. Shift Reduce解析器在运行时复杂度上是线性的,这比标准的词汇化PCFG解析器更好。

To use the shift reduce parser, you'll need the shift reduce models jar which you should put on your classpath (which you can download from CoreNLP's website separately). 要使用shift reduce解析器,你需要你应该放在classpath上的shift reduce models jar(你可以从CoreNLP的网站上单独下载)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM