简体   繁体   English

如何在python中将nltk树(斯坦福)转换为newick格式?

[英]How to convert a nltk tree (Stanford) into newick format in python?

I have this Stanford tree and I want to convert this into newick format.我有这棵斯坦福树,我想将其转换为 newick 格式。

    (ROOT
     (S
        (NP (DT A) (NN friend))
        (VP
         (VBZ comes)
         (NP
           (NP (JJ early))
           (, ,)
           (NP
             (NP (NNS others))
             (SBAR
                (WHADVP (WRB when))
                (S (NP (PRP they)) (VP (VBP have) (NP (NN time))))))))))

There might be ways to do this just using string processing, but I would parse them and print them in the newick format recursively.可能有一些方法可以仅使用字符串处理来做到这一点,但我会解析它们并以递归方式以 newick 格式打印它们。 A somewhat minimal implementation:一个最小的实现:

import re

class Tree(object):
    def __init__(self, label):
        self.label = label
        self.children = []

    @staticmethod
    def _tokenize(string):
        return list(reversed(re.findall(r'\(|\)|[^ \n\t()]+', string)))

    @classmethod
    def from_string(cls, string):
        tokens = cls._tokenize(string)
        return cls._tree(tokens)

    @classmethod
    def _tree(cls, tokens):
        t = tokens.pop()
        if t == '(':
            tree = cls(tokens.pop())
            for subtree in cls._trees(tokens):
                tree.children.append(subtree)
            return tree
        else:
            return cls(t)

    @classmethod
    def _trees(cls, tokens):
        while True:
            if not tokens:
                raise StopIteration
            if tokens[-1] == ')':
                tokens.pop()
                raise StopIteration
            yield cls._tree(tokens)

    def to_newick(self):
        if self.children and len(self.children) == 1:
            return ','.join(child.to_newick() for child in self.children)
        elif self.chilren:
            return '(' + ','.join(child.to_newick() for child in self.children) + ')'
        else:
            return self.label

Note that, of course, information gets lost during the conversion, since only terminal nodes are kept.请注意,当然,信息在转换过程中会丢失,因为只保留了终端节点。 Usage:用法:

>>> s = """(ROOT (..."""
>>> Tree.from_string(s).to_newick()
...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM