简体   繁体   English

树节点映射到GrammaticalStructure依赖项

[英]Tree node mapping to GrammaticalStructure dependency

I'm using the Stanford Core NLP framework 3.4.1 to construct syntactic parse trees of wikipedia sentences. 我正在使用Stanford Core NLP框架3.4.1来构建Wikipedia句子的语法分析树。 After which I would like to extract out of each parse tree all of the tree fragments of certain length (ie at most 5 nodes), but I am having a lot of trouble figuring out how to do that without creating a new GrammaticalStructure for each sub-tree. 之后,我想从每个解析树中提取一定长度(即最多5个节点)的所有树片段,但是在不为每个子对象创建新的GrammaticalStructure的情况下,我很难解决该问题。 -树。

This is what I am using to construct the parse tree, most of the code is from TreePrint.printTreeInternal() for conll2007 format which I modified to suit my output needs: 这就是我用来构造解析树的方法,大多数代码来自conPrint2007格式的TreePrint.printTreeInternal(),我对其进行了修改以适合我的输出需求:

    DocumentPreprocessor dp = new DocumentPreprocessor(new StringReader(documentText));

    for (List<HasWord> sentence : dp) {
        StringBuilder plaintexSyntacticTree = new StringBuilder();
        String sentenceString = Sentence.listToString(sentence);

        PTBTokenizer tkzr = PTBTokenizer.newPTBTokenizer(new StringReader(sentenceString));
        List toks = tkzr.tokenize();
        // skip sentences smaller than 5 words
        if (toks.size() < 5)
            continue;
        log.info("\nTokens are: "+PTBTokenizer.labelList2Text(toks));
        LexicalizedParser lp = LexicalizedParser.loadModel(
        "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz",
        "-maxLength", "80");
        TreebankLanguagePack tlp = new PennTreebankLanguagePack();
        GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
        Tree parse = lp.apply(toks);
        GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
        Collection<TypedDependency> tdl = gs.allTypedDependencies();
        Tree it = parse.deepCopy(parse.treeFactory(), CoreLabel.factory());
        it.indexLeaves();

        List<CoreLabel> tagged = it.taggedLabeledYield();
        // getSortedDeps
        List<Dependency<Label, Label, Object>> sortedDeps = new ArrayList<Dependency<Label, Label, Object>>();
        for (TypedDependency dep : tdl) {
            NamedDependency nd = new NamedDependency(dep.gov().label(), dep.dep().label(), dep.reln().toString());
            sortedDeps.add(nd);
        }
        Collections.sort(sortedDeps, Dependencies.dependencyIndexComparator());

        for (int i = 0; i < sortedDeps.size(); i++) {
          Dependency<Label, Label, Object> d = sortedDeps.get(i);

          CoreMap dep = (CoreMap) d.dependent();
          CoreMap gov = (CoreMap) d.governor();

          Integer depi = dep.get(CoreAnnotations.IndexAnnotation.class);
          Integer govi = gov.get(CoreAnnotations.IndexAnnotation.class);

          CoreLabel w = tagged.get(depi-1);

          // Used for both course and fine POS tag fields
          String tag = PTBTokenizer.ptbToken2Text(w.tag());

          String word = PTBTokenizer.ptbToken2Text(w.word());

          if (plaintexSyntacticTree.length() > 0)
              plaintexSyntacticTree.append(' ');
          plaintexSyntacticTree.append(word+'/'+tag+'/'+govi);
        }
        log.info("\nTree is: "+plaintexSyntacticTree);
    }

In the output I need to get something of this format: word/Part-Of-Speech-tag/parentID which is compatible with the output of the Google Syntactic N-Grams 在输出中,我需要获取以下格式的内容:word / Part-Of-Speech-tag / parentID与Google语法N语法的输出兼容

I can't see to figure out, how I could get the POS tag and parentID from the original syntactic parse tree (stored in the GrammaticalStructure as a dependency list as far as I understand) for only a subset of nodes from the original tree. 我看不出要如何从原始语法分析树(据我所知,它作为依存关系列表存储在GrammaticalStructure中)仅从原始树的节点子集中获取POS标签和parentID。

I have also seen some mentions about the HeadFinder but as far as I understand that is only useful to construct the GrammaticalStructure, whereas I am trying to use the existing one. 我也看到过一些有关HeadFinder的提及,但据我了解,这仅对构造GrammaticalStructure有用,而我正在尝试使用现有的。 I have also seen a somwewhat similar issue about converting GrammaticalStructure to Tree but that is still an open issue and it does not tackle the issue of sub-trees or creating a custom output. 我也看到过类似的关于将GrammaticalStructure转换为Tree的类似问题但这仍然是一个未解决的问题,它无法解决子树或创建自定义输出的问题。 Instead of creating a tree from the GrammaticalStructure I was thinking that I could just use the node reference from the tree to get the information I need, but I am basically missing an equivalent of getNodeByIndex() which can get index by node from GrammaticalStructure. 我不是从GrammaticalStructure创建树,而是想到可以仅使用树中的节点引用来获取所需的信息,但是我基本上缺少了等效的getNodeByIndex(),它可以从GrammaticalStructure逐个节点获取索引。

UPDATE: I have manage to get all of the required information by using the SemanticGraph as suggested in the answer. 更新:我已经设法通过使用答案中建议的SemanticGraph来获取所有必需的信息。 Here is a basic snippet of code that does that: 这是执行此操作的基本代码段:

    String documentText = value.toString();
    Properties props = new Properties();
    props.put("annotators", "tokenize,ssplit,pos,depparse");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation annotation = new Annotation(documentText);
    pipeline.annotate(annotation);
    List<CoreMap> sentences =  annotation.get(CoreAnnotations.SentencesAnnotation.class);

    if (sentences != null && sentences.size() > 0) {
        CoreMap sentence = sentences.get(0);
        SemanticGraph sg = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
        log.info("SemanticGraph: "+sg.toDotFormat());
       for (SemanticGraphEdge edge : sg.edgeIterable()) {
           int headIndex = edge.getGovernor().index();
           int depIndex = edge.getDependent().index();
           log.info("["+headIndex+"]"+edge.getSource().word()+"/"+depIndex+"/"+edge.getSource().get(CoreAnnotations.PartOfSpeechAnnotation.class));
       }
    }

The Google syntactic n-grams are using dependency trees rather than constituency trees. Google语法n-gram使用的是依赖树,而不是选区树。 So, indeed, the only way to get that representation is by converting the tree to a dependency tree. 因此,确实,获得该表示的唯一方法是将树转换为依赖树。 The parent id you get from the constituency parse will be for an intermediate node, rather than another word in the sentence. 您从选区分析中获得的父代ID将用于中间节点,而不是句子中的另一个单词。

My recommendation would be to run the dependency parser annotator ( annotators = tokenize,ssplit,pos,depparse ), and from the resulting SemanticGraph extract all clusters of 5 neighboring nodes. 我的建议是运行依赖项解析器注释器( annotators = tokenize,ssplit,pos,depparse ),然后从生成的SemanticGraph提取5个相邻节点的所有群集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM