简体   繁体   中英

How to extract derivation rules from a bracketed parse tree?

I have a lot of parse trees like this:

( S ( NP-SBJ ( PRP I  )  )  ( INODE@S ( VP ( VBP have  )  ( NP ( DT a  )  ( INODE@NP ( NN savings  )  ( NN account  )  )  )  )  ( . .  )  )  )

for a sentence like this: "I have a savings account ."

I need to extract all derivation rules from these trees. The derivation rules like:

S -> NP-SBJ INODE@S
NP-SBJ -> PRP 
PRP -> I
INODE@S -> VP NP
and so on.

Is there any prepared code (preferably in java) or pseudo code for this purpose?

Edit:

I think this problem is very general and common in many areas. The simplified problem is to find each parent and it's children from a parenthesis tree.

Step 1: Parse the bracketed string to create an AST

You may not have thought about it this way, but the string itself is defined by a context-free grammar:

Node :== '(' String Node* ')' |
         '(' String String ')'

Our first step is to use a recursive descent parser for this grammar to generate an abstract syntax tree defined by the following class:

class Node {
  string component;
  List<Node> children = new ArrayList<Node>();
  string word;
}

First, tokenize the bracketed string and put the tokens in a queue. I think that string.split("\\\\s+") should work, since all of the parentheses and strings are separated by spaces.

Node parse(Queue<string> tokens) throws ParseException {
  Node n = new Node();
  if (!tokens.remove().equals("(")) {
    throw new ParseException();
  }
  n.component = tokens.remove()
  if (n.component.equals("(") || n.component.equals(")")) {
    throw new ParseException();
  }
  if (tokens.element().equals("(")) {
    while (tokens.element().equals("(")) {
      Node child = parse(tokens);
      n.childen.add(child);
    }
  } else if (!tokens.element().equals(")")) {
    n.word = tokens.remove();
  } else {
    // we weren't expecting a close-paren yet
    throw new ParseException();
  }
  if (!tokens.remove.equals(")")) {
    throw new ParseException();
  }
  return n;
}

Step 2: Traverse the AST to construct the rules.

This step is performed using the pseudocode that Ira Baxter posted.

For each interior node N do:
    Use N's children C1, C2, ... Ck to generate a rule  "N = C1 C2 .. Ck". 
Eliminate duplicate rules.

For the purposes of this algorithm, an interior node is one where word == null , or where children is not empty. The "For each interior node N" step can be performed by either preorder or postorder traversal of the tree.

So let's define a rule class.

class Rule {
  string left;
  List<String> right = new ArrayList();

  // define the equals and hashcode methods appropriately.
  // We'll need them because we're inserting this class into
  // a HashSet.
}

Let's define a generic tree traversal function

interface Visitor {
  void handle(Node n);
}

void traverse(Node n, Visitor v) {
  v.handle(n);
  for (Node child: n.children) {
    traverse(child, v);
  }
}

And let's define the visitor that constructs and deduplicates rules

class RuleBuilder implements Visitor {
  Set<Rule> rules = new HashSet<Rule>;

  public void handle(Node n) {
    if (n.word != null) {
      return;
    }
    Rule r = new Rule();
    r.left = n.component;
    for (Node child: n.children) {
      r.right.add(child.component);
    }
    rules.add(r);
  }
}

Tying it together

Queue<string> tokens = new LinkedList(Arrays.asList(string.split("\\s+")));
Node ast = parse(tokens);
RuleBuilder ruleCollector = new RuleBuilder();
traverse(ast, ruleCollector)

The rules you need are in ruleCollector.rules

I wrote this for python. I believe you can read it as a pseudocode. I will edit the post for Java later . I added the Java implementation later.

import re

# grammar repository 
grammar_repo = []

s = "( S ( NP-SBJ ( PRP I  )  )  ( INODE@S ( VP ( VBP have  )  ( NP ( DT a  )  ( INODE@NP ( NN savings  )  ( NN account  )  )  )  )  ( . . )  )  )"
# clean the string (make sure there is no double space in it)
s = s.replace("   ", " ").replace("  ", " ")

# find inner parenthesis (no-need-to-parse-grammar) or (lhs rhs) format
simple_grammars = re.findall("\([^\(\)]*\)", s)
# repeat until you cannot find any ( lhs rhs ) grammar
while len(simple_grammars) > 0:
    # add all simple grammar into grammar repository
    # replace them with their head
    for simple_grammar in simple_grammars:
        grammar = simple_grammar.split(" ")
        # '(' == grammar[0] and ')' == grammar[-1]
        lhs = grammar[1]
        rhs = grammar[2:-1]
        grammar_repo.append((lhs, rhs))

        s = s.replace(simple_grammar, lhs)

    simple_grammars = re.findall("\([^\(\)]*\)", s)

In short, start from simplest grammar you can find and replace them with their left-hand-side and continue. eg find (PRP I) save it, then replace it with PRP . repeat until you find all the syntax.

Update: The Java implementation is a bit different but it's the same idea. The full code is here: http://ideone.com/0eE8bd

PrintStream ps = new PrintStream(System.out);
ArrayList grammarRepo = new ArrayList();
String item, grammar;
String str = "( S ( NP-SBJ ( PRP I  )  )  ( INODE@S ( VP ( VBP have  )  ( NP ( DT a  )  ( INODE@NP ( NN savings  )  ( NN account  )  )  )  )  ( . . )  )  )";
// cleanup double spaces
while (str.contains("  ")){
    str = str.replaceAll("  ", " ");
}
// find no parenthesis zones!
Matcher m = Pattern.compile("\\([^\\(\\)]*\\)").matcher(str);

// loop until nothing left:
while (m.find()) {
    item = m.group();
    // make the grammar:
    grammar = item.substring(1, item.length()-1).trim().replaceFirst(" ", " -> ");

    if (!grammarRepo.contains(grammar)) {
        grammarRepo.add(grammar);
        ps.print(grammar + "\n");
    }

    str = str.replace(item, grammar.split(" -> ")[0]);
    m = Pattern.compile("\\([^\\(\\)]*\\)").matcher(str);
}

the output:

PRP -> I
NP-SBJ -> PRP
VBP -> have
DT -> a
NN -> savings
NN -> account
INODE@NP -> NN NN
NP -> DT INODE@NP
VP -> VBP NP
. -> .
INODE@S -> VP .
S -> NP-SBJ INODE@S

Many parsers build ASTs that don't match the exact shape of the grammar.

In these cases, it should be clear that you cannot regenerate the original grammar from the AST. As a consequence, this is not problem with a widely avialable solution.

In the rare case where AST exactly matches the grammar, you can do better. (You can do this with a nonmatching AST to get a rough approximation of the grammar):

For each interior node N do:
    Use N's children C1, C2, ... Ck to generate a rule  "N = C1 C2 .. Ck". 
Eliminate duplicate rules.

This provides a subset of the grammar, that covers the instance tree you have. If a grammar rule was not used by the parser on this instance, it won't appear in the set. You might need to run lots of instance trees through this process to get good coverage on the rules, but you can't be sure at any point that you have the complete set.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM