[英]Stanford Core NLP - understanding coreference resolution
我在理解上一版斯坦福NLP工具中對coref解析器所做的更改時遇到了一些麻煩。 例如,下面是一個句子和相應的CorefChainAnnotation:
The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
{1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}
我不確定我理解這些數字的含義。 查看源代碼也沒有任何幫助。
謝謝
我一直在使用coreference依賴圖,我開始使用這個問題的另一個答案。 過了一會兒,雖然我意識到上面這個算法並不完全正確。 它產生的輸出甚至不接近我的修改版本。
對於使用這篇文章的任何人來說,這里是我最終得到的算法,它也過濾掉了自引用,因為每個代表性的注意事項也提到了自己,並且很多提及僅引用自己。
Map<Integer, CorefChain> coref = document.get(CorefChainAnnotation.class);
for(Map.Entry<Integer, CorefChain> entry : coref.entrySet()) {
CorefChain c = entry.getValue();
//this is because it prints out a lot of self references which aren't that useful
if(c.getCorefMentions().size() <= 1)
continue;
CorefMention cm = c.getRepresentativeMention();
String clust = "";
List<CoreLabel> tks = document.get(SentencesAnnotation.class).get(cm.sentNum-1).get(TokensAnnotation.class);
for(int i = cm.startIndex-1; i < cm.endIndex-1; i++)
clust += tks.get(i).get(TextAnnotation.class) + " ";
clust = clust.trim();
System.out.println("representative mention: \"" + clust + "\" is mentioned by:");
for(CorefMention m : c.getCorefMentions()){
String clust2 = "";
tks = document.get(SentencesAnnotation.class).get(m.sentNum-1).get(TokensAnnotation.class);
for(int i = m.startIndex-1; i < m.endIndex-1; i++)
clust2 += tks.get(i).get(TextAnnotation.class) + " ";
clust2 = clust2.trim();
//don't need the self mention
if(clust.equals(clust2))
continue;
System.out.println("\t" + clust2);
}
}
您的例句的最終輸出如下:
representative mention: "a basic unit of matter" is mentioned by:
The atom
it
通常“原子”最終成為代表性的提及,但在這種情況下它並不令人驚訝。 輸出稍微更精確的另一個例子是以下句子:
革命戰爭發生在18世紀,這是美國的第一次戰爭。
產生以下輸出:
representative mention: "The Revolutionary War" is mentioned by:
it
the first war in the United States
第一個數字是一個集群ID(代表標記,代表同一個實體),參見SieveCoreferenceSystem#coref(Document)
源代碼。 對數字不在CorefChain#toString()中:
public String toString(){
return position.toString();
}
其中position是一組提到實體的CorefChain.getCorefMentions()
對(讓他們使用CorefChain.getCorefMentions()
)。 這是一個完整代碼(在groovy中 )的示例,它顯示了如何從位置到令牌:
class Example {
public static void main(String[] args) {
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
props.put("dcoref.score", true);
pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation("The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.");
pipeline.annotate(document);
Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);
println aText
for(Map.Entry<Integer, CorefChain> entry : graph) {
CorefChain c = entry.getValue();
println "ClusterId: " + entry.getKey();
CorefMention cm = c.getRepresentativeMention();
println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex);
List<CorefMention> cms = c.getCorefMentions();
println "Mentions: ";
cms.each { it ->
print aText.subSequence(it.startIndex, it.endIndex) + "|";
}
}
}
}
輸出(我不明白's'來自哪里):
The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
ClusterId: 1
Representative Mention: he
Mentions: he|atom |s|
ClusterId: 6
Representative Mention: basic unit
Mentions: basic unit |
ClusterId: 8
Representative Mention: unit
Mentions: unit |
ClusterId: 10
Representative Mention: it
Mentions: it |
這些是注釋器的最新結果。
標記如下:
[Sentence number,'id'] Cluster_no Text_Associated
屬於同一群集的文本指的是相同的上下文。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.