简体   繁体   English

如何在斯坦福依赖解析器中保留标点符号

[英]How to keep punctuation in Stanford dependency parser

I am using Stanford CoreNLP (01.2016 version) and I would like to keep the punctuation in the dependency relations.我正在使用斯坦福 CoreNLP(01.2016 版本),我想在依赖关系中保留标点符号。 I have found some ways for doing that when you run it from command line, but I didn't find anything regarding the java code which extracts the dependency relations.当您从命令行运行它时,我找到了一些方法来执行此操作,但是我没有找到有关提取依赖关系的 java 代码的任何内容。

Here is my current code.这是我当前的代码。 It works, but no punctuation is included:它有效,但不包含标点符号:

Annotation document = new Annotation(text);

        Properties props = new Properties();

        props.setProperty("annotators", "tokenize, ssplit, pos, lemma, parse");

        props.setProperty("ssplit.newlineIsSentenceBreak", "always");

        props.setProperty("ssplit.eolonly", "true");

        props.setProperty("pos.model", modelPath1);

        props.put("parse.model", modelPath );

        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

        pipeline.annotate(document);

        LexicalizedParser lp = LexicalizedParser.loadModel(modelPath + lexparserNameEn,

                "-maxLength", "200", "-retainTmpSubcategories");

        TreebankLanguagePack tlp = new PennTreebankLanguagePack();

        GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();

        List<CoreMap> sentences = document.get(SentencesAnnotation.class);

        for (CoreMap sentence : sentences) {

            List<CoreLabel> words = sentence.get(CoreAnnotations.TokensAnnotation.class);               

            Tree parse = lp.apply(words);

            GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
            Collection<TypedDependency> td = gs.typedDependencies();

            parsedText += td.toString() + "\n";

Any kind of dependency relation is OK for me, basic, typed, collapsed, etc. I just want to include the punctuation marks.任何类型的依赖关系对我来说都可以,基本的、打字的、折叠的等等。我只想包括标点符号。

Thanks in advance,提前致谢,

You are doing quite a bit of extra work here as you are running the parser once through CoreNLP and then again by calling lp.apply(words) .您在这里做了很多额外的工作,因为您通过 CoreNLP 运行解析器一次,然后再次调用lp.apply(words)

The easiest way of getting a dependency tree/graph with punctuation marks is by using the CoreNLP option parse.keepPunct as following.获取带有标点符号的依赖树/图的最简单方法是使用 CoreNLP 选项parse.keepPunct如下。

Annotation document = new Annotation(text);
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, parse");
props.setProperty("ssplit.newlineIsSentenceBreak", "always");
props.setProperty("ssplit.eolonly", "true");
props.setProperty("pos.model", modelPath1);
props.setProperty("parse.model", modelPath);
props.setProperty("parse.keepPunct", "true");

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

pipeline.annotate(document);

for (CoreMap sentence : sentences) {
   //Pick whichever representation you want
   SemanticGraph basicDeps = sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class);
   SemanticGraph collapsed = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
   SemanticGraph ccProcessed = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);
}

The sentence annotation object stores the dependency trees/graphs as a SemanticGraph .句子注释对象将依赖树/图存储为SemanticGraph If you want a list of TypedDependency objects, use the method typedDependencies() .如果您需要TypedDependency对象的列表,请使用typedDependencies()方法。 For example,例如,

List<TypedDependency> dependencies = basicDeps.typedDependencies();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM