简体   繁体   中英

Stanford NLP CoreNLP don't do sentence split for chinese

My environment:

  • CoreNLP 3.5.1
  • stanford-chinese-corenlp-2015-01-30-models
  • default property file for chinese : StanfordCoreNLP-chinese.properties
    • annotators = segment, ssplit

My testing text is "這是第一個句子。這是第二個句子。" I get sentence from

val sentences = annotation.get(classOf[SentencesAnnotation])
for (sent <- sentences) {
  count+=1
  println("sentence{$count} = " + sent.get(classOf[TextAnnotation]))
}

It always prints the whole testing text as one sentence , not the expected two here :

sentence1 = 這是第一個句子。這是第二個句子。

the expected are:

expected sentence1 = 這是第一個句子。
expected sentence2 = 這是第二個句子。

Even the same result if I add more properties like :

ssplit.eolonly = false
ssplit.isOneSentence = false
ssplit.newlineIsSentenceBreak = always
ssplit.boundaryTokenRegex = [.]|[!?]+|[。]|[!?]+

The CoreNLP logs are

Registering annotator segment with class edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator
Adding annotator segment
Loading Segmentation Model [edu/stanford/nlp/models/segmenter/chinese/ctb.gz]...Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... Loading Chinese dictionaries from 1 files:
  edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz

loading dictionaries from edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz...Done. Unique words in ChineseDictionary is: 423200
done [56.9 sec].
done. Time elapsed: 57041 ms
Adding annotator ssplit
Adding Segmentation annotation...output: [null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null]
INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | usePKChar2=false
INFO: TagAffixDetector: building TagAffixDetector from edu/stanford/nlp/models/segmenter/chinese/dict/character_list and edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb
Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list
Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb
這是第一個句子。這是第二個句子。
--->
[這是, 第一, 個, 句子, 。, 這是, 第二, 個, 句子, 。]
done. Time elapsed: 419 ms

I once saw someone get the following log (CoreNLP 3.5.0) ; however oddly I do not have this log:

Adding annotator ssplit edu.stanford.nlp.pipeline.AnnotatorImplementations:ssplit.boundaryTokenRegex=[.]|[!?]+|[。]|[!?]+

What's the problem ? Is there workaround? If unresolvable I can split it myself but I do not know how to integrate my splits into the CoreNLP pipeline.

OK, I pull off a work around.

define the ssplit annotator myself.

For convenient I hardcoding the parameter here, though the right way should parse the props.

class MyWordsToSentencesAnnotator extends WordsToSentencesAnnotator(
  true,
  "[.]|[!?]+|[。]|[!?]+",
  null,
  null,
  "never") {
  def this(name: String, props: Properties) { this() }
}

and designate the class at property file.

customAnnotatorClass.myssplit = ...

Apparently ,I guess the default CoreNLP Pipeline setting or code has bug?

I met the same problem until I replaced the Chinese punctuation with its Unicode formats and set property as follows:

props.setProperty("ssplit.boundaryTokenRegex", "[.\u3002]|[!?\uFF01\uFF1F]+");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM