简体   繁体   中英

Chomsky-normal-form grammar extraction from a parse tree

I am trying to extract the Chomsky Normal Form (CNF) - grammar productions of a sentence from its parse tree:

(ROOT
  (S
    (NP (DT the) (NNS kids))
    (VP (VBD opened)
      (NP (DT the) (NN box))
      (PP (IN on)
        (NP (DT the) (NN floor)))))) 

I put the whole tree into a string named S and then:

tree = Tree.fromstring(S)
tree.chomsky_normal_form()
for p in tree.productions():
    print p

The output is

(1) NN -> 'box'
(2) PP -> IN NP
(3) DT -> 'the'
(4) ROOT -> S
(5) NP -> DT NN
(6) VBD -> 'opened'
(7) VP|<NP-PP> -> NP PP
(8) VP -> VBD VP|<NP-PP>
(9) NP -> DT NNS
(10) NN -> 'floor'
(11) IN -> 'on'
(12) NNS -> 'kids'
(13) S -> NP VP

But some of the productions (number 7 & 8) don't seem to be CNF! What is the problem?

VP|<NP-PP> is one nonterminal symbol. The vertical bar does not mean multiple options in the traditional sense. Rather, NLTK puts it there to indicate where the rule is derived from, ie "this new nonterminal symbol was derived from the combination of VP and NP-PP." It is a new production rule NLTK has created to convert your grammar into Chomsky Normal Form.

Take a look at the productions of the tree, pre-CNF:

ROOT -> S
S -> NP VP
NP -> DT NNS
DT -> 'the'
NNS -> 'kids'
VP -> VBD NP PP ***
VBD -> 'opened'
NP -> DT NN
DT -> 'the'
NN -> 'box'
PP -> IN NP
IN -> 'on'
NP -> DT NN
DT -> 'the'
NN -> 'floor'

Specifically, look at the rule VP -> VBD NP PP , which is NOT in CNF (There must be exactly two nonterminal symbols on the RHS of any production rule)

The two rules (7): VP|<NP-PP> -> NP PP and (8): VP -> VBD VP|<NP-PP> in your question are functionally equivalent to the more general rule VP -> VBD NP PP .

When VP is detected, rule application results in:

VBD VP|<NP-PP>

And, VP|<NP-PP> is the LHS of the production rule created, which results in:

VBD NP PP

Specifically, if you isolate the rule itself, you can take a look at the specific symbol (which is indeed singular):

>>> tree.chomsky_normal_form()
>>> prod = tree.productions()
>>> x = prod[7]  # VP|<NP-PP> -> NP PP
>>> x.lhs().symbol()  # Singular!
u'VP|<NP-PP>'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM