简体   繁体   中英

How to match integers in NLTK CFG?

If I want to define a grammar in which one of the tokens will match an integer, how can i achieve it using nltk's string CFG?

For example -

S -> SK SO FK
SK -> 'SELECT'
SO -> '\d+'
FK -> 'FROM'

Create a number phrase as such:

import nltk

groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I' | NUM N
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas' | 'elephants'
V -> 'shot'
P -> 'in'
NUM -> '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | '10'
""")

sent = 'I shot 3 elephants'.split()
parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
    print(tree)

[out]:

(S (NP I) (VP (V shot) (NP (NUM 3) (N elephants))))

But note that that can only handle single digit number. So let's try compressing integers into a single token-type, eg '#NUM#':

import nltk

groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I' | NUM N
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas' | 'elephants'
V -> 'shot'
P -> 'in'
NUM -> '#NUM#'
""")

sent = 'I shot 333 elephants'.split()
sent = ['#NUM#' if i.isdigit() else i for i in sent]

parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
    print(tree)

[out]:

(S (NP I) (VP (V shot) (NP (NUM #NUM#) (N elephants))))

To put the numbers back, try:

import nltk

groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I' | NUM N
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas' | 'elephants'
V -> 'shot'
P -> 'in'
NUM -> '#NUM#'
""")

original_sent = 'I shot 333 elephants'.split()
sent = ['#NUM#' if i.isdigit() else i for i in original_sent]
numbers = [i for i in original_sent if i.isdigit()]

parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
    treestr = str(tree)
    for n in numbers:
        treestr = treestr.replace('#NUM#', n, 1)
    print(treestr)

[out]:

(S (NP I) (VP (V shot) (NP (NUM 333) (N elephants))))

A simple solution is to define a function which creates a parser given the sentence and grammar. This works for the integer problem by augmenting the grammar for each function call to include productions for the integers in the sentence. Here is an example function:

def name_parser(G,sent):
    ints = [i for i in sent if i.isdigit()]
    lproductions = list(G.productions())
    lproduction.extend([nltk.grammar.Production(nltk.grammar.Nonterminal('INT'),[i]) for i in ints])
    lgrammar = nltk.grammar.CFG(G.start(),lproductions)
    parser = nltk.ChartParser(lgrammar)
    for tree in parser.parse(sent):
        print(tree)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM