简体   繁体   English

解析penn语法树以提取其语法规则

[英]Parse a penn syntax tree to extract its grammar rules

I have a PENN-Syntax-Tree and I would like to recursively get all rules that this tree contains. 我有一个PENN-Syntax-Tree,我想以递归方式获取该树包含的所有规则。

(ROOT 
(S 
   (NP (NN Carnac) (DT the) (NN Magnificent)) 
   (VP (VBD gave) (NP ((DT a) (NN talk))))
)
)

my target is to get the grammar rules like: 我的目标是获得如下的语法规则:

ROOT --> S
S --> NP VP
NP --> NN
...

As I said I need to do this recursively and without the NLTK Package or any other modules or regular expression . 正如我所说,我需要递归地执行此操作,而无需NLTK包或任何其他模块或正则表达式 Here's what I have so far. 这是我到目前为止所拥有的。 The parameter tree is a Penn-Tree splitted on each space. 参数tree是在每个空间上分割的Penn-Tree。

def extract_rules(tree):
    tree = tree[1:-1]
    print("\n\n")

    if len(tree) == 0:
        return

    root_node = tree[0]
    print("Current Root: "+root_node)

    remaining_tree = tree[1:]
    right_side = []

    temp_tree = list(remaining_tree)
    print("remaining_tree: ", remaining_tree)
    symbol = remaining_tree.pop(0)

    print("Symbol: "+symbol)

    if symbol not in ["(", ")"]:
        print("CASE: No Brackets")
        print("Rule: "+root_node+" --> "+str(symbol))

        right_side.append(symbol)

    elif symbol == "(":
        print("CASE: Opening Bracket")
        print("Temp Tree: ", temp_tree)
        cursubtree_end = bracket_depth(temp_tree)
        print("Subtree ends at position "+str(cursubtree_end)+" and Element is "+temp_tree[cursubtree_end])
        cursubtree_start = temp_tree.index(symbol)

        cursubtree = temp_tree[cursubtree_start:cursubtree_end+1]
        print("Subtree: ", cursubtree)

        rnode = extract_rules(cursubtree)
        if rnode:
            right_side.append(rnode)
            print("Rule: "+root_node+" --> "+str(rnode))

    print(right_side)
    return root_node


def bracket_depth(tree):
    counter = 0
    position = 0
    subtree = []

    for i, char in enumerate(tree):
        if char == "(":
            counter = counter + 1
        if char == ")":
            counter = counter - 1

        if counter == 0 and i != 0:
            counter = i
            position = i
            break

    subtree = tree[0:position+1]

    return position

Currently it works for the first subtree of S but all other subtrees are not getting parsed recursively. 目前它适用于S的第一个子树,但所有其他子树都不会被递归解析。 Would be glad for any help.. 很高兴得到任何帮助..

My inclination would be to keep it as simple as possible and not try to reinvent the parsing modules that you're currently not allowed to use. 我倾向于保持尽可能简单,而不是试图重新发明你目前不允许使用的解析模块。 Something like: 就像是:

string = '''
    (ROOT
        (S
            (NP (NN Carnac) (DT the) (NN Magnificent))
            (VP (VBD gave) (NP (DT a) (NN talk)))
        )
    )
'''

def is_symbol_char(character):
    '''
    Predicate to test if a character is valid
    for use in a symbol, extend as needed.
    '''

    return character.isalpha() or character in '-=$!?.'

def tokenize(characters):
    '''
    Process characters into a nested structure.  The original string
    '(DT the)' is passed in as ['(', 'D', 'T', ' ', 't', 'h', 'e', ')']
    '''

    tokens = []

    while characters:
        character = characters.pop(0)

        if character.isspace():
            pass  # nothing to do, ignore it

        elif character == '(':  # signals start of recursive analysis (push)
            characters, result = tokenize(characters)
            tokens.append(result)

        elif character == ')':  # signals end of recursive analysis (pop)
            break

        elif is_symbol_char(character):
            # if it looks like a symbol, collect all
            # subsequents symbol characters
            symbol = ''

            while is_symbol_char(character):
                symbol += character
                character = characters.pop(0)

            # push unused non-symbol character back onto characters
            characters.insert(0, character)

            tokens.append(symbol)

    # Return whatever tokens we collected and any characters left over
    return characters, tokens

def extract_rules(tokens):
    ''' Recursively walk tokenized data extracting rules. '''

    head, *tail = tokens

    print(head, '-->', *[x[0] if isinstance(x, list) else x for x in tail])

    for token in tail:  # recurse
        if isinstance(token, list):
            extract_rules(token)

characters, tokens = tokenize(list(string))

# After a successful tokenization, all the characters should be consumed
assert not characters, "Didn't consume all the input!"

print('Tokens:', tokens[0], 'Rules:', sep='\n\n', end='\n\n')

extract_rules(tokens[0])

OUTPUT OUTPUT

Tokens:

['ROOT', ['S', ['NP', ['NN', 'Carnac'], ['DT', 'the'], ['NN', 'Magnificent']], ['VP', ['VBD', 'gave'], ['NP', ['DT', 'a'], ['NN', 'talk']]]]]

Rules:

ROOT --> S
S --> NP VP
NP --> NN DT NN
NN --> Carnac
DT --> the
NN --> Magnificent
VP --> VBD NP
VBD --> gave
NP --> DT NN
DT --> a
NN --> talk

NOTE 注意

I changed your original tree as this clause: 我更改了原始树作为此子句:

(NP ((DT a) (NN talk)))

seemed incorrect as it was producing an empty node on a syntax tree grapher available on the web so I simplified it to: 似乎不正确,因为它在网络上可用的语法树grapher上生成一个空节点,所以我简化为:

(NP (DT a) (NN talk))

Adjust as needed. 根据需要调整。

This can be done in a much simpler manner. 这可以以更简单的方式完成。 Given we know the structure of our grammar is CNF LR, we can use a recursive regular expression parser to parse the text. 鉴于我们知道我们的语法结构是CNF LR,我们可以使用递归正则表达式解析器来解析文本。

There's a package called pyparser (you can install it with pip install pyparser if you don't already have it). 有一个名为pyparser的软件包(你可以使用pip install pyparser安装它,如果你还没有它)。

from pyparsing import nestedExpr

astring = '''(ROOT 
(S 
   (NP (NN Carnac) (DT the) (NN Magnificent)) 
   (VP (VBD gave) (NP ((DT a) (NN talk))))
)
)'''

expr = nestedExpr('(', ')')
result = expr.parseString(astring).asList()[0]
print(result)

This gives 这给了

['ROOT', ['S', ['NP', ['NN', 'Carnac'], ['DT', 'the'], ['NN', 'Magnificent']], ['VP', ['VBD', 'gave'], ['NP', [['DT', 'a'], ['NN', 'talk']]]]]]

So we've successfully translated our string into a hierarchy of lists. 因此,我们已成功将字符串转换为列表层次结构。 Now we need to write a little code to parse the list and extract rules. 现在我们需要编写一些代码来解析列表并提取规则。

def get_rules(result, rules):
    for l in result[1:]:
        if isinstance(l, list) and not isinstance(l[0], list):
            rules.add((result[0], l[0]))  
            get_rules(l, rules)

        elif isinstance(l[0], list):
            rules.add((result[0], tuple([x[0] for x in l])))
        else:
            rules.add((result[0], l))

    return rules

As I mentioned, we already know the structure of our grammar, so we've only to take care of a limited number of conditions here. 正如我所提到的,我们已经知道了语法的结构,因此我们只需要处理有限的条件。

Call this function as such: 像这样调用这个函数:

rules = get_rules(result, set()) # results was obtained from before

for i in rules:
   print i

Output: 输出:

('ROOT', 'S')
('VP', 'NP')
('DT', 'the')
('NP', 'NN')
('NP', ('DT', 'NN'))
('NP', 'DT')
('S', 'VP')
('VBD', 'gave')
('NN', 'Carnac')
('NN', 'Magnificent')
('S', 'NP')
('VP', 'VBD')

Order this as you need. 根据需要订购。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM