简体   繁体   中英

Stanford CoreNLP Options in Scala

Hello I am trying to update the following options in Stanford CoreNLP:

I am running Spark in Scala with the following versions:

Software Version
Spark 2.3.0
Scala 2.11.8
Java 8 (1.8.0_73)
spark-corenlp 0.3.1
stanford-corenlp 3.9.1

I have found what I believe is the definition on where the newlineIsSentenceBreak option is updated but when I try and implement I keep getting error messages.

Here is a working code snippet:

import edu.stanford.nlp.process.WordToSentenceProcessor

WordToSentenceProcessor.NewlineIsSentenceBreak.values
WordToSentenceProcessor.NewlineIsSentenceBreak.valueOf("ALWAYS")

But when I try and set the option I get an error. Specifically I am trying to implement something similar to:

WordToSentenceProcessor.NewlineIsSentenceBreak.stringToNewlineIsSentenceBreak("ALWAYS")

but I get this error:

error: value stringToNewlineIsSentenceBreak is not a member of object edu.stanford.nlp.process.WordToSentenceProcessor.NewlineIsSentenceBreak

Any help is appreciated!

Thank you stackoverflow for being my rubber duck! https://en.wikipedia.org/wiki/Rubber_duck_debugging

To set the parameters in Scala (not using the spark wrapper functions) you can assign it to the properties of the pipeline object like this:

val props: Properties = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,ner")
props.put("ssplit.newlineIsSentenceBreak", "always")
props.put("ner.applyFineGrained", "false")

Before creating a Stanford Core NLP pipeline:

val pipeline: StanfordCoreNLP = new StanfordCoreNLP(props)

Because the Spark wrapper functions use the simple implementation I don't think I can modify them? Please post an answer if you are aware of how to do that!

Here is a full example:

import java.util.Properties

import edu.stanford.nlp.ling.CoreAnnotations.{SentencesAnnotation, TextAnnotation, TokensAnnotation}
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.util.CoreMap

import scala.collection.JavaConverters._

val props: Properties = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,ner")
props.put("ssplit.newlineIsSentenceBreak", "always")
props.put("ner.applyFineGrained", "false")
val pipeline: StanfordCoreNLP = new StanfordCoreNLP(props)
val text = "Quick brown fox jumps over the lazy dog. This is Harshal, he lives in Chicago.  I added \nthis sentence"

// create blank annotator
val document: Annotation = new Annotation(text)

// run all Annotator - Tokenizer on this text
pipeline.annotate(document)

val sentences: List[CoreMap] = document.get(classOf[SentencesAnnotation]).asScala.toList

(for {
    sentence: CoreMap <- sentences
    token: CoreLabel <- sentence.get(classOf[TokensAnnotation]).asScala.toList
    lemmas: String = token.word()
    ner = token.ner()
} yield (sentence, lemmas, ner)) foreach(t => println("sentence: " + t._1 + " | lemmas: " + t._2 + " | ner: " +  t._3))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM