简体   繁体   English

如何将Newick树格式转换为树状分层对象?

[英]How to convert Newick tree format to a tree-like hierarchical object?

I want to convert a Newick file to a hierarchical object (similar to what has been posted in this post ) in Python. 我想在Python中将Newick文件转换为分层对象(类似于本文中已发布的内容 )。

My input is a Newick file like this: 我的输入是一个Newick文件,如下所示:

(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F:0.9

The original post parses the string character by character. 原始帖子逐字符解析字符串。 To store the branch lengths also, I have modified the JavaScript file (from here ) as follows: 为了也存储分支长度,我修改了JavaScript文件(从此处开始 ),如下所示:

 var newick = '// (A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F:0.9', stack = [], child, root = [], node = root; var na = ""; newick.split('').reverse().forEach(function(n) { switch(n) { case ')': // ')' => begin child node if (na != "") { node.push(child = { name: na }); na = ""; } stack.push(node); child.children = []; node = child.children; break; case '(': // '(' => end of child node if (na != "") { node.push(child = { name: na }); na = ""; } node = stack.pop(); // console.log(node); break; case ',': // ',' => separator (ignored) if (na != "") { node.push(child = { name: na }); na = ""; } break; default: // assume all other characters are node names // node.push(child = { name: n }); na += n; break; } }); console.log(node); 

Now, I want to translate this code to Python. 现在,我想将此代码转换为Python。

Here's my attempt (I know it's incorrect): 这是我的尝试(我知道这是不正确的):

class Node:

  def __init__(self):
    self.Name = ""
    self.Value = 0
    self.Children = []

newick = "(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5,G:0.8)F:0.9"
stack = []
# root = []
# node = []

for i in list(reversed(newick)):
  if i == ')':
    if na != "":
      node = Node()
      node.Name = na
      child.append(node)
      na = ""
    stack.append(node)
    # insert logic
    child = node.Children
    # child.append(child)

  elif i == '(':
    if (na != ""):
      child = Node()
      child.Name = na
      node.append(child)
      na = ""
    node = stack.pop()
  elif i == ',':
    if (na != ""):
      node = Node()
      node.Name = na
      node.append(child)
      na = ""
  else:
    na += n

Since I am totally new to JavaScript, I am having trouble 'translating' the code to Python. 由于我对JavaScript完全陌生,因此无法将代码“翻译”为Python。 In particular, I didn't understand the following lines: 特别是,我不明白以下几行:

child.children = [];
node = child.children;

How can I correctly write this in Python, to also extract the lengths? 我如何才能正确地用Python编写此代码以提取长度?

The following code might not be an exact translation of the javascript code, but it works as expected. 以下代码可能不是javascript代码的确切翻译,但可以正常使用。 There were some issues like "n" to be not defined. 有一些问题,例如“ n”尚未定义。 I also added the parsing of the node name into name and value, and a parent field. 我还将节点名称的解析添加到名称和值以及父字段中。

You should consider using already existent parsers like https://biopython.org/wiki/Phylo as they already give you the infrastructure and algorithms to work with the trees. 您应该考虑使用已经存在的解析器,例如https://biopython.org/wiki/Phylo,因为它们已经为您提供了与树配合使用的基础结构和算法。

class Node:
    # Added parsing of the "na" variable to name and value.
    # Added a parent field
    def __init__(self, name_val):
        name, val_str = name_val[::-1].split(":")
        self.name = name
        self.value = float(val_str)
        self.children = []
        self.parent = None

    # Method to get the depth of the node (for printing)
    def get_depth(self):
        current_node = self
        depth = 0
        while current_node.parent:
            current_node = current_node.parent
            depth += 1
        return depth

    # String representation
    def __str__(self):
        return "{}:{}".format(self.name, self.value)

newick = "(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5,G:0.8)F:0.9"

root = None
# na was not defined before.
na = ""
stack = []
for i in list(reversed(newick)):
    if i == ')':
        if na != "":
            node = Node(na)
            na = ""
            if len(stack):
                stack[-1].children.append(node)
                node.parent = stack[-1]
            else:
                root = node
            stack.append(node)

    elif i == '(':
        if (na != ""):
            node = Node(na)
            na = ""
            stack[-1].children.append(node)
            node.parent = stack[-1]
        stack.pop()
    elif i == ',':
        if (na != ""):
            node = Node(na)
            na = ""
            stack[-1].children.append(node)
            node.parent = stack[-1]
    else:
        # n was not defined before, changed to i.
        na += i

# Just to print the parsed tree.
print_stack = [root]
while len(print_stack):
    node = print_stack.pop()
    print(" " * node.get_depth(), node)
    print_stack.extend(node.children)

The output of the print bit at the end is the following: 最后的打印位的输出如下:

 F:0.9
  A:0.1
  B:0.2
  E:0.5
   C:0.3
   D:0.4
  G:0.8

Some comments on the JavaScript version: 有关JavaScript版本的一些评论:

  • It has some code repetition ( if (na != '') ... ) which is easy to avoid. 它具有一些易于避免的代码重复( if (na != '') ... )。
  • It uses node as variable name for an array. 它使用node作为数组的变量名。 Readability is improved when you use a plural word for arrays (or lists in Python). 当对数组(或Python中的列表)使用复数单词时,可读性得到了改善。
  • It does not output what you want to have: it outputs nodes with names like "9.0:F", not isolating the length from the name. 它不输出您想要的内容:它输出名称如“ 9.0:F”的节点,而不将长度与名称隔离。

Because of the last point, the code needs first to be corrected before making the translation into Python. 由于最后一点,在翻译成Python之前,首先需要更正代码。 It should support splitting the name/length attributes, allowing either of them to be optional. 它应该支持拆分名称/长度属性,允许它们中的任一个都是可选的。 Additionally, it could assign id values to each created node and add a parentid property to refer to a node's parent. 另外,它可以为每个创建的节点分配id值,并添加parentid属性以引用节点的父节点。

I personally prefer coding with recursion instead of using a stack variable. 我个人更喜欢使用递归编码而不是使用堆栈变量。 Also, with a regular expression API you can easily tokenise the input to facilitate the parsing: 另外,使用正则表达式API,您可以轻松地对输入进行标记化以促进解析:

JavaScript version of Newick format parser Newick格式解析器的JavaScript版本

 function parse(newick) { let nextid = 0; const regex = /([^:;,()\\s]*)(?:\\s*:\\s*([\\d.]+)\\s*)?([,);])|(\\S)/g; newick += ";" return (function recurse(parentid = -1) { const children = []; let name, length, delim, ch, all, id = nextid++;; [all, name, length, delim, ch] = regex.exec(newick); if (ch == "(") { while ("(,".includes(ch)) { [node, ch] = recurse(id); children.push(node); } [all, name, length, delim, ch] = regex.exec(newick); } return [{id, name, length: +length, parentid, children}, delim]; })()[0]; } // Example use: console.log(parse("(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5,G:0.8)F:0.9")); 
 .as-console-wrapper { max-height: 100% !important; top: 0; } 

Python version of Newick format parser Python版本的Newick格式解析器

import re

def parse(newick):
    tokens = re.findall(r"([^:;,()\s]*)(?:\s*:\s*([\d.]+)\s*)?([,);])|(\S)", newick+";")

    def recurse(nextid = 0, parentid = -1): # one node
        thisid = nextid;
        children = []

        name, length, delim, ch = tokens.pop(0)
        if ch == "(":
            while ch in "(,":
                node, ch, nextid = recurse(nextid+1, thisid)
                children.append(node)
            name, length, delim, ch = tokens.pop(0)
        return {"id": thisid, "name": name, "length": float(length) if length else None, 
                "parentid": parentid, "children": children}, delim, nextid

    return recurse()[0]

# Example use:
print(parse("(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5,G:0.8)F:0.9"))

About the assignment node = child.children in your JavaScript code: this moves the "pointer" (ie node ) one level deeper in the tree that is being created so that in the next iteration of the algorithm any new nodes are appended at that level. 关于JavaScript代码中的分配node = child.children :这会将“指针”(即node )移到正在创建的树中更深的一层,以便在算法的下一次迭代中,任何新的节点都将添加到该层。 With node = stack.pop() that pointer tracks back one level up in the tree. 使用node = stack.pop() ,指针将在树中向上追溯一级。

Here is a pyparsing parser for this input string. 这是此输入字符串的pyparsing解析器。 It uses pyparsing's nestedExpr parser builder, with a defined content argument so that the results are parsed key-value pairs, not just simple strings (which is the default). 它使用pyparsing的nestedExpr解析器构建器,并带有已定义的content参数,以便将结果解析为键值对,而不仅仅是简单的字符串(默认值)。

import pyparsing as pp
# suppress punctuation literals from parsed output
pp.ParserElement.inlineLiteralsUsing(pp.Suppress)

ident = pp.Word(pp.alphas)
value = pp.pyparsing_common.real

element = pp.Group(ident + ':' + value)
parser = pp.OneOrMore(pp.nestedExpr(content=pp.delimitedList(element) + pp.Optional(','))
                      | pp.delimitedList(element))

tests = """
    (A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F:0.9
"""
parsed_results = parser.parseString(tests)
import pprint
pprint.pprint(parsed_results.asList(), width=20)

Gives: 给出:

[[['A', 0.1],
  ['B', 0.2],
  [['C', 0.3],
   ['D', 0.4]],
  ['E', 0.5]],
 ['F', 0.9]]

Note that the pyparsing expression for parsing reals also does parse-time conversion to Python floats. 请注意,用于解析实数的pyparsing表达式也可以将解析时间转换为Python浮点数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM