简体   繁体   English

NLTK 话语树到边列表

[英]NLTK discourse tree to edge list

I have following string:我有以下字符串:

dt = ' ( NS-elaboration ( EDU 1 )  ( NS-elaboration ( EDU 2 )  ( NS-elaboration ( EDU 3 )  ( EDU 4 )  )  )  ) '

I can convert it to an NLTK tree as follows:我可以将其转换为 NLTK 树,如下所示:

from nltk import Tree
t = Tree.fromstring(dt)

This tree is illustrated in this link .此树在此链接中进行了说明。

What I want is the edge list of this tree.我想要的是这棵树的边缘列表。 Something similar to the following:类似于以下内容:

NS-elaboration0    EDU1
NS-elaboration0    NS-elaboration1
NS-elaboration1    EDU2
NS-elaboration1    NS-elaboration2
NS-elaboration2    EDU3
NS-elaboration2    EDU4

where the number after NS-elaboration is the height of the tree.其中NS-elaboration之后的数字是树的高度。

I tried to find a builtin for this, but in the end I just built the following algorithm:我试图为此找到一个内置函数,但最后我只是构建了以下算法:

Code:代码:

from nltk import Tree

def get_edges(tree, i):
    from_str = f"{tree.label()}{i}"
    children = [f"{child.label()}{child.leaves()[0]}" for child in tree if isinstance(child, Tree) and child.height() == 2]
    children.extend([f"{child.label()}{i+1}" for child in tree if isinstance(child, Tree) and child.height() > 2])
    return [(from_str, child) for child in children]

def tree_to_edges(tree):
    height = 0
    rv = []
    to_check = [tree]
    while to_check:
        tree_to_check = to_check.pop(0)
        rv.extend(get_edges(tree_to_check, height))
        height += 1
        to_check.extend([child for child in tree_to_check if isinstance(child, Tree) and child.height() > 2])
    return rv

Usage:用法:

>>> dt = ' ( NS-elaboration ( EDU 1 )  ( NS-elaboration ( EDU 2 )  ( NS-elaboration ( EDU 3 )  ( EDU 4 )  )  )  ) '
>>> t = Tree.fromstring(dt)
>>> tree_to_edges(t)
[('NS-elaboration0', 'EDU1'),
 ('NS-elaboration0', 'NS-elaboration1'),
 ('NS-elaboration1', 'EDU2'),
 ('NS-elaboration1', 'NS-elaboration2'),
 ('NS-elaboration2', 'EDU3'),
 ('NS-elaboration2', 'EDU4')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM