简体   繁体   English

使用Stanford CoreNLP的共指解析

[英]Coreference resolution using Stanford CoreNLP

I am new to the Stanford CoreNLP toolkit and trying to use it for a project to resolve coreferences in news texts. 我是Stanford CoreNLP工具包的新手,正在尝试将其用于解决新闻文本中的共同引用的项目。 In order to use the Stanford CoreNLP coreference system, we would usually create a pipeline, which requires tokenization, sentence splitting, part-of-speech tagging, lemmarization, named entity recoginition and parsing. 为了使用Stanford CoreNLP共参考系统,我们通常会创建一个管道,该管道需要标记化,句子拆分,词性标记,词缀化,命名实体重新识别和解析。 For example: 例如:

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// read some text in the text variable
String text = "As competition heats up in Spain's crowded bank market, Banco Exterior de Espana is seeking to shed its image of a state-owned bank and move into new activities.";

// create an empty Annotation just with the given text
Annotation document = new Annotation(text);

// run all Annotators on this text
pipeline.annotate(document);

Then we can easily get the sentence annotations with: 然后,我们可以使用以下命令轻松获得句子注释:

List<CoreMap> sentences = document.get(SentencesAnnotation.class);

However, I am using other tools for for preprocessing and just need a stand-alone coreference resolution system. 但是,我使用其他工具进行预处理,只需要一个独立的共指解析系统。 It is pretty easy to create tokens and parse tree annotations and set them to the annotation: 创建标记并解析树注释并将它们设置为注释非常容易:

// create new annotation
Annotation annotation = new Annotation();

// create token annotations for each sentence from the input file
List<CoreLabel> tokens = new ArrayList<>();
for(int tokenCount = 0; tokenCount < parsedSentence.size(); tokenCount++) {

        ArrayList<String> parsedLine = parsedSentence.get(tokenCount);
        String word = parsedLine.get(1);
        String lemma = parsedLine.get(2);
        String posTag = parsedLine.get(3);
        String namedEntity = parsedLine.get(4); 
        String partOfParseTree = parsedLine.get(6);

        CoreLabel token = new CoreLabel();
        token.setWord(word);
        token.setWord(lemma);
        token.setTag(posTag);
        token.setNER(namedEntity);
        tokens.add(token);
    }

// set tokens annotations to annotation
annotation.set(TokensAnnotation.class, tokens);

// set parse tree annotations to annotation
Tree stanfordParseTree = Tree.valueOf(inputParseTree);
annotation.set(TreeAnnotation.class, stanfordParseTree);

However, creating sentence annotations is pretty tricky, because to my knowledge there is no document to explain it in full detail. 但是,创建句子注释非常棘手,因为据我所知,没有文档可以对其进行详细说明。 I am able to create the data structure for the sentence annotations and set it to the annotation: 我能够为句子注释创建数据结构并将其设置为注释:

List<CoreMap> sentences = new ArrayList<CoreMap>();
annotation.set(SentencesAnnotation.class, sentences);

I am sure it cannot be that difficult, but there is no documentation on how to create sentence annotation from tokens annotations, ie how to fill the ArrayList with actual sentence annotations. 我敢肯定这不会那么困难,但是没有文档说明如何从标记注释创建句子注释,即如何用实际的句子注释填充ArrayList。

Any ideas? 有任何想法吗?

Btw, if I use the tokens and parse tree annotations provided by my processing tools and only use the sentence annotations provided by the StanfordCoreNLP pipeline and apply the StanfordCoreNLP stand-alone coreference resolution system I am getting the correct results. 顺便说一句,如果我使用处理工具提供的标记和语法分析树注释,并且仅使用StanfordCoreNLP管道提供的句子注释并应用StanfordCoreNLP独立的共指解析系统,我将获得正确的结果。 So the only part missing for a complete stand-alone coreference resolution system is the ability to create the sentence annotations from the tokens annotations. 因此,完整的独立共指解析系统唯一缺少的部分是能够根据标记注释创建句子注释。

There is a Annotation constructor with a List<CoreMap> sentences argument which sets up the document if you have a list of already tokenized sentences. 有一个带List<CoreMap> sentences参数的Annotation 构造函数 ,如果您有一个已标记化的句子的列表,它将设置文档。

For each sentence you want to create a CoreMap object as following. 您要为每个句子创建一个CoreMap对象,如下所示。 (Note that I also added a sentence and token index to each sentence and token object, respectively.) (请注意,我还分别向每个句子和标记对象添加了一个句子和标记索引。)

int sentenceIdx = 1;
List<CoreMap> sentences = new ArrayList<CoreMap>();
for (parsedSentence : parsedSentences) {
    CoreMap sentence = new CoreLabel();
    List<CoreLabel> tokens = new ArrayList<>();
    for(int tokenCount = 0; tokenCount < parsedSentence.size(); tokenCount++) {

        ArrayList<String> parsedLine = parsedSentence.get(tokenCount);
        String word = parsedLine.get(1);
        String lemma = parsedLine.get(2);
        String posTag = parsedLine.get(3);
        String namedEntity = parsedLine.get(4); 
        String partOfParseTree = parsedLine.get(6);

        CoreLabel token = new CoreLabel();
        token.setWord(word);
        token.setLemma(lemma);
        token.setTag(posTag);
        token.setNER(namedEntity);
        token.setIndex(tokenCount + 1);
        tokens.add(token);
    }

    // set tokens annotations and id of sentence 
    sentence.set(TokensAnnotation.class, tokens);
    sentence.set(SentenceIndexAnnotation.class, sentenceIdx++);

    // set parse tree annotations to annotation
    Tree stanfordParseTree = Tree.valueOf(inputParseTree);
    sentence.set(TreeAnnotation.class, stanfordParseTree);

    // add sentence to list of sentences
    sentences.add(sentence);
}

Then you can create an Annotation instance with the sentences list: 然后,您可以使用sentences列表创建一个Annotation实例:

Annotation annotation = new Annotation(sentences);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM