[英]How to construct a tree from string?
Input string: '(SBARQ (WHADVP (WRB Where)) (VBZ is) (NP (DT the) (NN cow)))'
输入字符串:
'(SBARQ (WHADVP (WRB Where)) (VBZ is) (NP (DT the) (NN cow)))'
This is the tree looks like:这是树的样子:
Aims to build a TreeNode with label as node and all its covering string as value.旨在构建一个以 label 为节点,其所有覆盖字符串为值的 TreeNode。
class TreeNode:
def __init__(self):
self.childre=[]
self.label=""
self.text=""
This is the what expected the tree looks like:这是树的预期样子:
SBARQ: "Where is the cow"
WHADVP: "Where"
WRB: "Where"
VBZ: "is"
NP: "the cow"
DT: "the"
NN: "cow"
The main thing is to identify children.最主要的是识别孩子。 I'll try to explain my code:
我将尝试解释我的代码:
class TreeNode:
def __init__(self, string):
self.im_leaf = string[0] != '('
self.children = []
if not self.im_leaf:
first_space_index = string.index(" ")
self.label = string[1:first_space_index]
string = string[first_space_index + 1 : -1] #rest of the tree
for node in self.split(string):
self.children.append(TreeNode(node))
else:
self.label = string
self.text = self.calculate_text()
def calculate_text(self):
if self.im_leaf:
return self.label + ' '
text = ''
for node in self.children:
text += node.calculate_text()
return text
def split(self, string):
splitted = []
pair_parenthesis = 0
index = 0
for i in range(len(string)):
char = string[i]
if char == '(':
pair_parenthesis += 1
elif char == ')':
pair_parenthesis -= 1
if pair_parenthesis == 0:
new_node = string[index:i+1]
if new_node[0] == ' ':
new_node = new_node[1:]
splitted.append(new_node)
index = i + 1
if len(splitted) == 0:
splitted = [string]
return splitted
First, in __init__
I recognize if the given string describes a leaf which is determined by the first character, if it isn't a "open parenthesis" ( '('
) then the current string describes a leaf,eg "cow"
or Where
; differently to "not-leaf strings" such "(NP (DT the) (NN cow))"
.首先,在
__init__
中,我识别给定的字符串是否描述了由第一个字符确定的叶子,如果它不是“左括号”( '('
)那么当前字符串描述了一个叶子,例如"cow"
或Where
; 与"(NP (DT the) (NN cow))"
非叶字符串”不同。
If string
describes a leaf then label = string
, else label
is the substring between the first parenthesis (which is always in the first position) and the first white space.如果
string
描述叶子,则label = string
,否则label
是第一个括号(始终位于第一个位置)和第一个空格之间的 substring。 Also, in the latter case it is necessary to identify the different branches, which is done with the method split
.此外,在后一种情况下,有必要识别不同的分支,这是通过方法
split
完成的。 Then, recursively the children
are identified.然后,递归地识别
children
。
The split
method identify the branches by counting open and closed parenthesis. split
方法通过计算开括号和闭括号来识别分支。 Note that the first branch in "(WHADVP (WRB Where)) (VBZ is) (NP (DT the) (NN cow))"
is (WHADVP (WRB Where))
, ie exactly when the number of open parenthesis is equal to the closed ones for the first time in the code (this difference is what is stored in pair_parenthesis
).请注意,
"(WHADVP (WRB Where)) (VBZ is) (NP (DT the) (NN cow))"
中的第一个分支是(WHADVP (WRB Where))
,即恰好当左括号的数量等于代码中第一次封闭的(这个区别是存储在pair_parenthesis
中的)。 The same occurs with the second and third branch.第二个和第三个分支也是如此。 If a branch has only one child, for example
"(WRB Where)"
, the method will be called with the string "Where"
, in those cases it returns "[Where]"
.如果一个分支只有一个子节点,例如
"(WRB Where)"
,将使用字符串"Where"
调用该方法,在这种情况下它返回"[Where]"
。
Finally, text
is assigned with the method calculate_text
, which basically does recursive calls through the tree searching for leaves.最后,
text
被分配给方法calculate_text
,它基本上通过搜索叶子的树进行递归调用。 If the node is a leaf, then its label is added to text
.如果节点是叶子,则将其 label 添加到
text
。
Now some tests:现在进行一些测试:
test_tree = TreeNode('(SBARQ (WHADVP (WRB Where)) (VBZ is) (NP (DT the) (NN cow)))')
print(test_tree.label)
#SBARQ
print(test_tree.text)
#Where is the cow
print(test_tree.children[0].text) #text from the "WHADVP" node
#Where
Please let me know if something is not clear.如果不清楚,请告诉我。
You could use recursion to create the tree and the aggregated texts simultaneously.您可以使用递归同时创建树和聚合文本。
I would:我会:
assert
statements which would display readable exception messages when the input format is not as expected;assert
语句,当输入格式不符合预期时,将显示可读的异常消息;__iter__
on your class, so that you can easily iterate over all the nodes that are present in the tree__iter__
,以便您可以轻松地遍历树中存在的所有节点import re
class TreeNode:
def __init__(self, label=""):
self.children = []
self.label = label
self.text = ""
@staticmethod
def create_from_string(text):
tokens = re.finditer(r"[()]|[^\s()]+", text)
match = next(tokens)
token = match.group()
assert token == "(", "Expected '(' at {}, but got '{}'".format(match.start(), token)
def recur():
node = None
while True:
match = next(tokens)
i = match.start()
token = match.group()
if token == ")":
assert node, "Expected label at {}, but got ')'".format(i)
assert node.text, "Expected text at {}, but got ')'".format(i)
return node
if token == "(":
assert node, "Expected label at {}, but got '('".format(i)
child = recur()
node.children.append(child)
token = child.text
if node:
node.text = "{} {}".format(node.text, token).lstrip()
else:
node = TreeNode(token)
return recur()
def __iter__(self):
def nodes():
yield self
for child in self.children:
yield from child
return nodes()
Here is how the above can be used for your concrete example:以下是如何将以上内容用于您的具体示例:
s = '(SBARQ (WHADVP (WRB Where)) (VBZ is) (NP (DT the) (NN cow)))'
tree = TreeNode.create_from_string(s)
for node in tree:
print("{}: {}".format(node.label, node.text))
This latter code will output:后一个代码将为 output:
SBARQ: Where is the cow
WHADVP: Where
WRB: Where
VBZ: is
NP: the cow
DT: the
NN: cow
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.