How to match integers in NLTK CFG?

Question

If I want to define a grammar in which one of the tokens will match an integer, how can i achieve it using nltk's string CFG?

For example -

S -> SK SO FK
SK -> 'SELECT'
SO -> '\d+'
FK -> 'FROM'

Answer 1

Create a number phrase as such:

import nltk

groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I' | NUM N
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas' | 'elephants'
V -> 'shot'
P -> 'in'
NUM -> '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | '10'
""")

sent = 'I shot 3 elephants'.split()
parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
    print(tree)

[out]:

(S (NP I) (VP (V shot) (NP (NUM 3) (N elephants))))

But note that that can only handle single digit number. So let's try compressing integers into a single token-type, eg '#NUM#':

import nltk

groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I' | NUM N
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas' | 'elephants'
V -> 'shot'
P -> 'in'
NUM -> '#NUM#'
""")

sent = 'I shot 333 elephants'.split()
sent = ['#NUM#' if i.isdigit() else i for i in sent]

parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
    print(tree)

[out]:

(S (NP I) (VP (V shot) (NP (NUM #NUM#) (N elephants))))

To put the numbers back, try:

import nltk

groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I' | NUM N
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas' | 'elephants'
V -> 'shot'
P -> 'in'
NUM -> '#NUM#'
""")

original_sent = 'I shot 333 elephants'.split()
sent = ['#NUM#' if i.isdigit() else i for i in original_sent]
numbers = [i for i in original_sent if i.isdigit()]

parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
    treestr = str(tree)
    for n in numbers:
        treestr = treestr.replace('#NUM#', n, 1)
    print(treestr)

[out]:

(S (NP I) (VP (V shot) (NP (NUM 333) (N elephants))))

Answer 2

A simple solution is to define a function which creates a parser given the sentence and grammar. This works for the integer problem by augmenting the grammar for each function call to include productions for the integers in the sentence. Here is an example function:

def name_parser(G,sent):
    ints = [i for i in sent if i.isdigit()]
    lproductions = list(G.productions())
    lproduction.extend([nltk.grammar.Production(nltk.grammar.Nonterminal('INT'),[i]) for i in ints])
    lgrammar = nltk.grammar.CFG(G.start(),lproductions)
    parser = nltk.ChartParser(lgrammar)
    for tree in parser.parse(sent):
        print(tree)

How to match integers in NLTK CFG?

Question

2 answers

solution1
1 ACCPTED 2015-02-07 11:24:26

solution2
0

How to match integers in NLTK CFG?

Question

2 answers

solution1 1 ACCPTED 2015-02-07 11:24:26

solution2 0

solution1
1 ACCPTED 2015-02-07 11:24:26

solution2
0