簡體   English   中英

如何在python中將nltk樹(斯坦福)轉換為newick格式?

[英]How to convert a nltk tree (Stanford) into newick format in python?

我有這棵斯坦福樹,我想將其轉換為 newick 格式。

    (ROOT
     (S
        (NP (DT A) (NN friend))
        (VP
         (VBZ comes)
         (NP
           (NP (JJ early))
           (, ,)
           (NP
             (NP (NNS others))
             (SBAR
                (WHADVP (WRB when))
                (S (NP (PRP they)) (VP (VBP have) (NP (NN time))))))))))

可能有一些方法可以僅使用字符串處理來做到這一點,但我會解析它們並以遞歸方式以 newick 格式打印它們。 一個最小的實現:

import re

class Tree(object):
    def __init__(self, label):
        self.label = label
        self.children = []

    @staticmethod
    def _tokenize(string):
        return list(reversed(re.findall(r'\(|\)|[^ \n\t()]+', string)))

    @classmethod
    def from_string(cls, string):
        tokens = cls._tokenize(string)
        return cls._tree(tokens)

    @classmethod
    def _tree(cls, tokens):
        t = tokens.pop()
        if t == '(':
            tree = cls(tokens.pop())
            for subtree in cls._trees(tokens):
                tree.children.append(subtree)
            return tree
        else:
            return cls(t)

    @classmethod
    def _trees(cls, tokens):
        while True:
            if not tokens:
                raise StopIteration
            if tokens[-1] == ')':
                tokens.pop()
                raise StopIteration
            yield cls._tree(tokens)

    def to_newick(self):
        if self.children and len(self.children) == 1:
            return ','.join(child.to_newick() for child in self.children)
        elif self.chilren:
            return '(' + ','.join(child.to_newick() for child in self.children) + ')'
        else:
            return self.label

請注意,當然,信息在轉換過程中會丟失,因為只保留了終端節點。 用法:

>>> s = """(ROOT (..."""
>>> Tree.from_string(s).to_newick()
...

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM