简体   繁体   English

如何从字符串构造树?

[英]How to construct a tree from string?

Input string: '(SBARQ (WHADVP (WRB Where)) (VBZ is) (NP (DT the) (NN cow)))'输入字符串: '(SBARQ (WHADVP (WRB Where)) (VBZ is) (NP (DT the) (NN cow)))'

This is the tree looks like:这是树的样子:

这是树的样子

Aims to build a TreeNode with label as node and all its covering string as value.旨在构建一个以 label 为节点,其所有覆盖字符串为值的 TreeNode。

class TreeNode:
   def __init__(self):
        self.childre=[]
        self.label=""
        self.text=""

This is the what expected the tree looks like:这是树的预期样子:

SBARQ: "Where is the cow"
WHADVP: "Where"
WRB: "Where"
VBZ: "is"
NP: "the cow"
DT: "the"
NN: "cow"

The main thing is to identify children.最主要的是识别孩子。 I'll try to explain my code:我将尝试解释我的代码:

class TreeNode:
    def __init__(self, string):
        
        self.im_leaf = string[0] != '('
        self.children = []

        if not self.im_leaf:
            first_space_index = string.index(" ")
            self.label = string[1:first_space_index]
            
            string = string[first_space_index + 1 : -1] #rest of the tree
            for node in self.split(string):
                self.children.append(TreeNode(node))
        else:
            self.label = string

        self.text = self.calculate_text()

    def calculate_text(self):
        if self.im_leaf:
            return self.label + ' '
        
        text = ''
        for node in self.children:
            text += node.calculate_text()

        return text


    def split(self, string):
        
        splitted = []
        pair_parenthesis = 0 
        index = 0

        for i in range(len(string)):
            char = string[i]

            if char == '(':
                pair_parenthesis += 1
            elif char == ')':
                pair_parenthesis -= 1
                if pair_parenthesis == 0:

                    new_node = string[index:i+1]
                    if new_node[0] == ' ':
                        new_node = new_node[1:]
                    splitted.append(new_node)
                    index = i + 1

        if len(splitted) == 0:  
            splitted = [string]
        
        return splitted

First, in __init__ I recognize if the given string describes a leaf which is determined by the first character, if it isn't a "open parenthesis" ( '(' ) then the current string describes a leaf,eg "cow" or Where ; differently to "not-leaf strings" such "(NP (DT the) (NN cow))" .首先,在__init__中,我识别给定的字符串是否描述了由第一个字符确定的叶子,如果它不是“左括号”( '(' )那么当前字符串描述了一个叶子,例如"cow"Where ; 与"(NP (DT the) (NN cow))"非叶字符串”不同。

If string describes a leaf then label = string , else label is the substring between the first parenthesis (which is always in the first position) and the first white space.如果string描述叶子,则label = string ,否则label是第一个括号(始终位于第一个位置)和第一个空格之间的 substring。 Also, in the latter case it is necessary to identify the different branches, which is done with the method split .此外,在后一种情况下,有必要识别不同的分支,这是通过方法split完成的。 Then, recursively the children are identified.然后,递归地识别children

The split method identify the branches by counting open and closed parenthesis. split方法通过计算开括号和闭括号来识别分支。 Note that the first branch in "(WHADVP (WRB Where)) (VBZ is) (NP (DT the) (NN cow))" is (WHADVP (WRB Where)) , ie exactly when the number of open parenthesis is equal to the closed ones for the first time in the code (this difference is what is stored in pair_parenthesis ).请注意, "(WHADVP (WRB Where)) (VBZ is) (NP (DT the) (NN cow))"中的第一个分支是(WHADVP (WRB Where)) ,即恰好当左括号的数量等于代码中第一次封闭的(这个区别是存储在pair_parenthesis中的)。 The same occurs with the second and third branch.第二个和第三个分支也是如此。 If a branch has only one child, for example "(WRB Where)" , the method will be called with the string "Where" , in those cases it returns "[Where]" .如果一个分支只有一个子节点,例如"(WRB Where)" ,将使用字符串"Where"调用该方法,在这种情况下它返回"[Where]"

Finally, text is assigned with the method calculate_text , which basically does recursive calls through the tree searching for leaves.最后, text被分配给方法calculate_text ,它基本上通过搜索叶子的树进行递归调用。 If the node is a leaf, then its label is added to text .如果节点是叶子,则将其 label 添加到text

Now some tests:现在进行一些测试:

test_tree = TreeNode('(SBARQ (WHADVP (WRB Where)) (VBZ is) (NP (DT the) (NN cow)))')

print(test_tree.label)
#SBARQ

print(test_tree.text)
#Where is the cow 

print(test_tree.children[0].text) #text from the "WHADVP" node
#Where

Please let me know if something is not clear.如果不清楚,请告诉我。

You could use recursion to create the tree and the aggregated texts simultaneously.您可以使用递归同时创建树和聚合文本。

I would:我会:

  • Allow the constructor to be called with a label argument;允许使用 label 参数调用构造函数;
  • Define the function, that creates a tree from a string, as a static method on your class;将 function 定义为从字符串创建树的 class 上的 static 方法;
  • Use a regular expression to tokenise the input;使用正则表达式来标记输入;
  • Add some assert statements which would display readable exception messages when the input format is not as expected;添加一些assert语句,当输入格式不符合预期时,将显示可读的异常消息;
  • Define __iter__ on your class, so that you can easily iterate over all the nodes that are present in the tree在您的 class 上定义__iter__ ,以便您可以轻松地遍历树中存在的所有节点
import re

class TreeNode:
    def __init__(self, label=""):
        self.children = []
        self.label = label
        self.text = ""

    @staticmethod
    def create_from_string(text):
        tokens = re.finditer(r"[()]|[^\s()]+", text)
        match = next(tokens)
        token = match.group()
        assert token == "(", "Expected '(' at {}, but got '{}'".format(match.start(), token)

        def recur():
            node = None
            while True:
                match = next(tokens)
                i = match.start()
                token = match.group()
                if token == ")":
                    assert node, "Expected label at {}, but got ')'".format(i)
                    assert node.text, "Expected text at {}, but got ')'".format(i)
                    return node
                if token == "(":
                    assert node, "Expected label at {}, but got '('".format(i)
                    child = recur()
                    node.children.append(child)
                    token = child.text
                if node:
                    node.text = "{} {}".format(node.text, token).lstrip() 
                else:
                    node = TreeNode(token)

        return recur()

    def __iter__(self):
        def nodes():
            yield self
            for child in self.children:
                yield from child
        return nodes()

Here is how the above can be used for your concrete example:以下是如何将以上内容用于您的具体示例:

s = '(SBARQ (WHADVP (WRB Where)) (VBZ is) (NP (DT the) (NN cow)))'
tree = TreeNode.create_from_string(s)
for node in tree:
    print("{}: {}".format(node.label, node.text))

This latter code will output:后一个代码将为 output:

SBARQ: Where is the cow
WHADVP: Where
WRB: Where
VBZ: is
NP: the cow
DT: the
NN: cow

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM