简体   繁体   中英

NLTK fcfg grammar parser out of index

I'm new to NLTK. Trying to transform "show me the movies" into a simple SQL SELECT statement "SELECT title from films".

I believe the sentence is (VP + NP) with VP(V + PRO) and NP(DET + N). However I have no doubt the .fcfg grammar I'm setting up is incorrect, I'm getting the following index error on "anwser = trees", with trees being empty.

How to correct the .fcfg?

IndexError: list index out of range

Process finished with exit code 1

Parser

% start S
S[SEM=(?np + WHERE + ?vp)] -> NP[SEM=?np] VP[SEM=?vp]
VP[SEM=(?v + ?pro)] -> V[SEM=?v] PRO[SEM=?pro]
NP[SEM=(?det + ?n)] -> Det[SEM=?det] N[SEM=?n]
Det[SEM=''] -> 'the'
PRO[SEM=''] -> 'me'
N[SEM='title FROM films'] -> 'movies'
V[SEM='SELECT'] -> 'show'

Python code

from nltk import load_parser
cp = load_parser('parser3.fcfg')
query = 'show me the movies'
trees = list(cp.parse(query.split()))
print(trees)
answer = trees[0].label()['SEM']
answer = [s for s in answer if s]
q = ' '.join(answer)
print(q)

To debug the grammar, start small and grow your rules.

Lets start from underspecified VP and feature-structured V and N

from nltk import grammar, parse
from nltk.parse.generate import generate

g = """
VP -> V N
V[SEM='SELECT'] -> 'show'
N[SEM='title FROM films'] -> 'movies'
"""

my_grammar =  grammar.FeatureGrammar.fromstring(g)

parser = parse.FeatureEarleyChartParser(my_grammar)
trees = parser.parse('show movies'.split())
print (list(trees)) 

[out]:

[Tree(VP[], [Tree(V[SEM='SELECT'], ['show']), Tree(N[SEM='title FROM films'], ['movies'])])]

Now, lets add the determiner.

g = """
VP -> V NP
NP[SEM=(?det + ?n)] -> DT[SEM=?det] N[SEM=?n]
DT[SEM=''] -> 'the'
V[SEM='SELECT'] -> 'show'
N[SEM='title FROM films'] -> 'movies'
"""

my_grammar =  grammar.FeatureGrammar.fromstring(g)

parser = parse.FeatureEarleyChartParser(my_grammar)
trees = parser.parse('show the movies'.split())
print (list(trees)) 

[out]:

[Tree(VP[], [Tree(V[SEM='SELECT'], ['show']), Tree(NP[SEM=(, title FROM films)], [Tree(DT[SEM=''], ['the']), Tree(N[SEM='title FROM films'], ['movies'])])])]

Then we add the pronoun.

We want to parse the sentence "show me the movie" as

S[ VP[show me] NP[the movie] ]

so we have to change how our TOP to S -> VP NP .

g = """
S -> VP NP
VP[SEM=(?v + ?pro)] -> V[SEM=?v] PRO[SEM=?pro]
NP[SEM=(?det + ?n)] -> DT[SEM=?det] N[SEM=?n]
V[SEM='SELECT'] -> 'show'
PRO[SEM=''] -> 'me'
DT[SEM=''] -> 'the'
N[SEM='title FROM films'] -> 'movies'
"""

my_grammar =  grammar.FeatureGrammar.fromstring(g)

parser = parse.FeatureEarleyChartParser(my_grammar)
trees = parser.parse('show me the movies'.split())
print (list(trees)) 

[out]:

[Tree(S[], [Tree(VP[SEM=(SELECT, )], [Tree(V[SEM='SELECT'], ['show']), Tree(PRO[SEM=''], ['me'])]), Tree(NP[SEM=(, title FROM films)], [Tree(DT[SEM=''], ['the']), Tree(N[SEM='title FROM films'], ['movies'])])])]

Here comes the mystery

Our TOP rule is underspecified for now, but if we specify both the left-hand-side (LHS) and right-hand-side (RHS), we see that it doesn't work:

g = """
S[SEM=(?vp + WHERE + ?np)] -> VP[SEM=?vp] NP[SEM=?np]

VP[SEM=(?v + ?pro)] -> V[SEM=?v] PRO[SEM=?pro]
NP[SEM=(?det + ?n)] -> DT[SEM=?det] N[SEM=?n]

V[SEM='SELECT'] -> 'show'
PRO[SEM=''] -> 'me'
DT[SEM=''] -> 'the'
N[SEM='title FROM films'] -> 'movies'
"""

my_grammar =  grammar.FeatureGrammar.fromstring(g)

parser = parse.FeatureEarleyChartParser(my_grammar)
trees = parser.parse('show me the movies'.split())
print (list(trees)) 

Even if we remove the WHERE semantics, we see that it didn't parse:

g = """
S[SEM=(?vp + ?np)] -> VP[SEM=?vp] NP[SEM=?np]

VP[SEM=(?v + ?pro)] -> V[SEM=?v] PRO[SEM=?pro]
NP[SEM=(?det + ?n)] -> DT[SEM=?det] N[SEM=?n]

V[SEM='SELECT'] -> 'show'
PRO[SEM=''] -> 'me'
DT[SEM=''] -> 'the'
N[SEM='title FROM films'] -> 'movies'
"""

my_grammar =  grammar.FeatureGrammar.fromstring(g)

parser = parse.FeatureEarleyChartParser(my_grammar)
trees = parser.parse('show me the movies'.split())
print (list(trees)) 

[out]:

[]

But if we specify only the RHS, it parses:

g = """
S -> VP[SEM=?vp] NP[SEM=?np]

VP[SEM=(?v + ?pro)] -> V[SEM=?v] PRO[SEM=?pro]
NP[SEM=(?det + ?n)] -> DT[SEM=?det] N[SEM=?n]

V[SEM='SELECT'] -> 'show'
PRO[SEM=''] -> 'me'
DT[SEM=''] -> 'the'
N[SEM='title FROM films'] -> 'movies'
"""

my_grammar =  grammar.FeatureGrammar.fromstring(g)

parser = parse.FeatureEarleyChartParser(my_grammar)
trees = parser.parse('show me the movies'.split())
print (list(trees)) 

[out]:

[Tree(S[], [Tree(VP[SEM=(SELECT, )], [Tree(V[SEM='SELECT'], ['show']), Tree(PRO[SEM=''], ['me'])]), Tree(NP[SEM=(, title FROM films)], [Tree(DT[SEM=''], ['the']), Tree(N[SEM='title FROM films'], ['movies'])])])]

The same works when we specify only the LHS:

g = """
S[SEM=(?vp + WHERE + ?np)] -> VP NP

VP[SEM=(?v + ?pro)] -> V[SEM=?v] PRO[SEM=?pro]
NP[SEM=(?det + ?n)] -> DT[SEM=?det] N[SEM=?n]

V[SEM='SELECT'] -> 'show'
PRO[SEM=''] -> 'me'
DT[SEM=''] -> 'the'
N[SEM='title FROM films'] -> 'movies'
"""

my_grammar =  grammar.FeatureGrammar.fromstring(g)

parser = parse.FeatureEarleyChartParser(my_grammar)
trees = parser.parse('show me the movies'.split())
print (list(trees)) 

[out]:

[Tree(S[SEM=(?vp+WHERE+?np)], [Tree(VP[SEM=(SELECT, )], [Tree(V[SEM='SELECT'], ['show']), Tree(PRO[SEM=''], ['me'])]), Tree(NP[SEM=(, title FROM films)], [Tree(DT[SEM=''], ['the']), Tree(N[SEM='title FROM films'], ['movies'])])])]

Lets recap

We can specify the non-terminals as we have done for NP and VP but what makes the TOP (ie S -> VP NP ) special?

What if we hack grammar and just give a unary branch up from?

g = """
S -> SP
SP[SEM=(?vp + WHERE + ?np)] -> VP[SEM=?vp] NP[SEM=?np]

VP[SEM=(?v + ?pro)] -> V[SEM=?v] PRO[SEM=?pro]
NP[SEM=(?det + ?n)] -> DT[SEM=?det] N[SEM=?n]

V[SEM='SELECT'] -> 'show'
PRO[SEM=''] -> 'me'
DT[SEM=''] -> 'the'
N[SEM='title FROM films'] -> 'movies'
"""

my_grammar =  grammar.FeatureGrammar.fromstring(g)

parser = parse.FeatureEarleyChartParser(my_grammar)
trees = parser.parse('show me the movies'.split())
print (list(trees)) 

It parsed!

[out]:

[Tree(S[], [Tree(SP[SEM=(SELECT, , WHERE, , title FROM films)], [Tree(VP[SEM=(SELECT, )], [Tree(V[SEM='SELECT'], ['show']), Tree(PRO[SEM=''], ['me'])]), Tree(NP[SEM=(, title FROM films)], [Tree(DT[SEM=''], ['the']), Tree(N[SEM='title FROM films'], ['movies'])])])])]

Someone should raise this question/issue to the NLTK github repo. It looks like it might be a special feature to protect the TOP rule or a bug =)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM