解析penn语法树以提取其语法规则

Question

I have a PENN-Syntax-Tree and I would like to recursively get all rules that this tree contains. 我有一个PENN-Syntax-Tree，我想以递归方式获取该树包含的所有规则。

(ROOT 
(S 
   (NP (NN Carnac) (DT the) (NN Magnificent)) 
   (VP (VBD gave) (NP ((DT a) (NN talk))))
)
)

my target is to get the grammar rules like: 我的目标是获得如下的语法规则：

ROOT --> S
S --> NP VP
NP --> NN
...

As I said I need to do this recursively and without the NLTK Package or any other modules or regular expression . 正如我所说，我需要递归地执行此操作，而无需NLTK包或任何其他模块或正则表达式 。 Here's what I have so far. 这是我到目前为止所拥有的。 The parameter tree is a Penn-Tree splitted on each space. 参数tree是在每个空间上分割的Penn-Tree。

def extract_rules(tree):
    tree = tree[1:-1]
    print("\n\n")

    if len(tree) == 0:
        return

    root_node = tree[0]
    print("Current Root: "+root_node)

    remaining_tree = tree[1:]
    right_side = []

    temp_tree = list(remaining_tree)
    print("remaining_tree: ", remaining_tree)
    symbol = remaining_tree.pop(0)

    print("Symbol: "+symbol)

    if symbol not in ["(", ")"]:
        print("CASE: No Brackets")
        print("Rule: "+root_node+" --> "+str(symbol))

        right_side.append(symbol)

    elif symbol == "(":
        print("CASE: Opening Bracket")
        print("Temp Tree: ", temp_tree)
        cursubtree_end = bracket_depth(temp_tree)
        print("Subtree ends at position "+str(cursubtree_end)+" and Element is "+temp_tree[cursubtree_end])
        cursubtree_start = temp_tree.index(symbol)

        cursubtree = temp_tree[cursubtree_start:cursubtree_end+1]
        print("Subtree: ", cursubtree)

        rnode = extract_rules(cursubtree)
        if rnode:
            right_side.append(rnode)
            print("Rule: "+root_node+" --> "+str(rnode))

    print(right_side)
    return root_node


def bracket_depth(tree):
    counter = 0
    position = 0
    subtree = []

    for i, char in enumerate(tree):
        if char == "(":
            counter = counter + 1
        if char == ")":
            counter = counter - 1

        if counter == 0 and i != 0:
            counter = i
            position = i
            break

    subtree = tree[0:position+1]

    return position

Currently it works for the first subtree of S but all other subtrees are not getting parsed recursively. 目前它适用于S的第一个子树，但所有其他子树都不会被递归解析。 Would be glad for any help.. 很高兴得到任何帮助..

Answer 1

My inclination would be to keep it as simple as possible and not try to reinvent the parsing modules that you're currently not allowed to use. 我倾向于保持尽可能简单，而不是试图重新发明你目前不允许使用的解析模块。 Something like: 就像是：

string = '''
    (ROOT
        (S
            (NP (NN Carnac) (DT the) (NN Magnificent))
            (VP (VBD gave) (NP (DT a) (NN talk)))
        )
    )
'''

def is_symbol_char(character):
    '''
    Predicate to test if a character is valid
    for use in a symbol, extend as needed.
    '''

    return character.isalpha() or character in '-=$!?.'

def tokenize(characters):
    '''
    Process characters into a nested structure.  The original string
    '(DT the)' is passed in as ['(', 'D', 'T', ' ', 't', 'h', 'e', ')']
    '''

    tokens = []

    while characters:
        character = characters.pop(0)

        if character.isspace():
            pass  # nothing to do, ignore it

        elif character == '(':  # signals start of recursive analysis (push)
            characters, result = tokenize(characters)
            tokens.append(result)

        elif character == ')':  # signals end of recursive analysis (pop)
            break

        elif is_symbol_char(character):
            # if it looks like a symbol, collect all
            # subsequents symbol characters
            symbol = ''

            while is_symbol_char(character):
                symbol += character
                character = characters.pop(0)

            # push unused non-symbol character back onto characters
            characters.insert(0, character)

            tokens.append(symbol)

    # Return whatever tokens we collected and any characters left over
    return characters, tokens

def extract_rules(tokens):
    ''' Recursively walk tokenized data extracting rules. '''

    head, *tail = tokens

    print(head, '-->', *[x[0] if isinstance(x, list) else x for x in tail])

    for token in tail:  # recurse
        if isinstance(token, list):
            extract_rules(token)

characters, tokens = tokenize(list(string))

# After a successful tokenization, all the characters should be consumed
assert not characters, "Didn't consume all the input!"

print('Tokens:', tokens[0], 'Rules:', sep='\n\n', end='\n\n')

extract_rules(tokens[0])

OUTPUT OUTPUT

Tokens:

['ROOT', ['S', ['NP', ['NN', 'Carnac'], ['DT', 'the'], ['NN', 'Magnificent']], ['VP', ['VBD', 'gave'], ['NP', ['DT', 'a'], ['NN', 'talk']]]]]

Rules:

ROOT --> S
S --> NP VP
NP --> NN DT NN
NN --> Carnac
DT --> the
NN --> Magnificent
VP --> VBD NP
VBD --> gave
NP --> DT NN
DT --> a
NN --> talk

NOTE 注意

I changed your original tree as this clause: 我更改了原始树作为此子句：

(NP ((DT a) (NN talk)))

seemed incorrect as it was producing an empty node on a syntax tree grapher available on the web so I simplified it to: 似乎不正确，因为它在网络上可用的语法树grapher上生成一个空节点，所以我简化为：

(NP (DT a) (NN talk))

Adjust as needed. 根据需要调整。

Answer 2

This can be done in a much simpler manner. 这可以以更简单的方式完成。 Given we know the structure of our grammar is CNF LR, we can use a recursive regular expression parser to parse the text. 鉴于我们知道我们的语法结构是CNF LR，我们可以使用递归正则表达式解析器来解析文本。

There's a package called pyparser (you can install it with pip install pyparser if you don't already have it). 有一个名为pyparser的软件包（你可以使用pip install pyparser安装它，如果你还没有它）。

from pyparsing import nestedExpr

astring = '''(ROOT 
(S 
   (NP (NN Carnac) (DT the) (NN Magnificent)) 
   (VP (VBD gave) (NP ((DT a) (NN talk))))
)
)'''

expr = nestedExpr('(', ')')
result = expr.parseString(astring).asList()[0]
print(result)

This gives 这给了

['ROOT', ['S', ['NP', ['NN', 'Carnac'], ['DT', 'the'], ['NN', 'Magnificent']], ['VP', ['VBD', 'gave'], ['NP', [['DT', 'a'], ['NN', 'talk']]]]]]

So we've successfully translated our string into a hierarchy of lists. 因此，我们已成功将字符串转换为列表层次结构。 Now we need to write a little code to parse the list and extract rules. 现在我们需要编写一些代码来解析列表并提取规则。

def get_rules(result, rules):
    for l in result[1:]:
        if isinstance(l, list) and not isinstance(l[0], list):
            rules.add((result[0], l[0]))  
            get_rules(l, rules)

        elif isinstance(l[0], list):
            rules.add((result[0], tuple([x[0] for x in l])))
        else:
            rules.add((result[0], l))

    return rules

As I mentioned, we already know the structure of our grammar, so we've only to take care of a limited number of conditions here. 正如我所提到的，我们已经知道了语法的结构，因此我们只需要处理有限的条件。

Call this function as such: 像这样调用这个函数：

rules = get_rules(result, set()) # results was obtained from before

for i in rules:
   print i

Output: 输出：

('ROOT', 'S')
('VP', 'NP')
('DT', 'the')
('NP', 'NN')
('NP', ('DT', 'NN'))
('NP', 'DT')
('S', 'VP')
('VBD', 'gave')
('NN', 'Carnac')
('NN', 'Magnificent')
('S', 'NP')
('VP', 'VBD')

Order this as you need. 根据需要订购。

解析penn语法树以提取其语法规则

问题描述

2 个解决方案

解决方案1
4 已采纳 2017-06-25 09:04:56

解决方案2
3 2017-06-22 21:27:11

解析penn语法树以提取其语法规则

问题描述

2 个解决方案

解决方案1 4 已采纳 2017-06-25 09:04:56

解决方案2 3 2017-06-22 21:27:11

解决方案1
4 已采纳 2017-06-25 09:04:56

解决方案2
3 2017-06-22 21:27:11