简体   繁体   中英

How to use NLTK to generate sentences from an induced grammar?

I have a (large) list of parsed sentences (which were parsed using the Stanford parser), for example, the sentence "Now you can be entertained" has the following tree:

(ROOT
  (S
    (ADVP (RB Now))
    (, ,)
    (NP (PRP you))
    (VP (MD can)
      (VP (VB be)
        (VP (VBN entertained))))
    (. .)))

I am using the set of sentence trees to induce a grammar using nltk:

import nltk

# ... for each sentence tree t, add its production to allProductions
allProductions += t.productions()

# Induce the grammar
S = nltk.Nonterminal('S')
grammar = nltk.induce_pcfg(S, allProductions)

Now I would like to use grammar to generate new, random sentences. My hope is that since the grammar was learned from a specific set of input examples, then the generated sentences will be semantically similar. Can I do this in nltk?

If I can't use nltk to do this, do any other tools exist that can take the (possibly reformatted) grammar and generate sentences?

In NLTK 2.0 you can use nltk.parse.generate to generate all possible sentences for a given grammar .

This code defines a function which should generate a single sentence based on the production rules in a (P)CFG.

# This example uses choice to choose from possible expansions
from random import choice
# This function is based on _generate_all() in nltk.parse.generate
# It therefore assumes the same import environment otherwise.
def generate_sample(grammar, items=["S"]):
    frags = []
    if len(items) == 1:
        if isinstance(items[0], Nonterminal):
            for prod in grammar.productions(lhs=items[0]):
                frags.append(generate_sample(grammar, prod.rhs()))
        else:
            frags.append(items[0])
    else:
        # This is where we need to make our changes
        chosen_expansion = choice(items)
        frags.append(generate_sample,chosen_expansion)
    return frags

To make use of the weights in your PCFG, you'll obviously want to use a better sampling method than choice() , which implicitly assumes all expansions of the current node are equiprobable.

First of all, if you generate random sentences, they may be semantically correct, but they will probably lose their sense.

(It sounds to me a bit like those MIT students did with their SCIgen program which is auto-generating scientific paper. Very interesting btw.)

Anyway, I never did it myself, but it seems possible with nltk.bigrams, you may way to have a look there under Generating Random Text with Bigrams .

You can also generate all subtrees of a current tree , I'm not sure if it is what you want either.

With an nltk Text object you can call 'generate()' on it which will "Print random text, generated using a trigram language model." http://nltk.org/_modules/nltk/text.html

My solution to generate a random sentence from an existing nltk.CFG grammar:

def generate_sample(grammar, prod, frags):        
    if prod in grammar._lhs_index: # Derivation
        derivations = grammar._lhs_index[prod]            
        derivation = random.choice(derivations)            
        for d in derivation._rhs:            
            generate_sample(grammar, d, frags)
    elif prod in grammar._rhs_index:
        # terminal
        frags.append(str(prod))

And now it can be used:

frags = []  
generate_sample(grammar, grammar.start(), frags)
print( ' '.join(frags) )

Inspired by the above, here's one which uses iteration instead of recursion.

import random

def rewrite_at(index, replacements, the_list):
    del the_list[index]
    the_list[index:index] = replacements

def generate_sentence(grammar):
    sentence_list = [grammar.start()]
    all_terminals = False
    while not all_terminals:
        all_terminals = True
        for position, symbol in enumerate(sentence_list):
            if symbol in grammar._lhs_index:
                all_terminals = False
                derivations = grammar._lhs_index[symbol]
                derivation = random.choice(derivations) # or weighted_choice(derivations) if you have a function for that
                rewrite_at(position, derivation.rhs(), sentence_list)
    return sentence_list

Or if you want the tree of the derivation, this one.

from nltk.tree import Tree

def tree_from_production(production):
    return Tree(production.lhs(), production.rhs())

def leaf_positions(the_tree):
    return [the_tree.leaf_treeposition(i) for i in range(len(the_tree.leaves()))]

def generate_tree(grammar):
    initial_derivations = grammar._lhs_index[grammar.start()]
    initial_derivation = random.choice(initial_derivations) # or weighed_choice if you have that function
    running_tree = tree_from_production(initial_derivation)
    all_terminals = False
    while not all_terminals:
        all_terminals = True
        for position in leaf_positions(running_tree):
            node_label = running_tree[position]
            if node_label in grammar._lhs_index:
                all_terminals = False
                derivations = grammar._lhs_index[node_label]
                derivation = random.choice(derivations) # or weighed_choice if you have that function
                running_tree[position] = tree_from_production(derivation)
    return running_tree

Here's a weighted_choice function for NLTK PCFG production rules to use with the above, adapted from Ned Batchelder's answer here for weighted choice functions in general:

def weighted_choice(productions):
    prods_with_probs = [(prod, prod.prob()) for prod in productions]
    total = sum(prob for prod, prob in prods_with_probs)
    r = random.uniform(0, total)
    upto = 0
    for prod, prob in prods_with_probs:
        if upto + prob >= r:
            return prod
        upto += prob
    assert False, "Shouldn't get here"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM