简体   繁体   中英

Dynamically Created Grammar in NLTK

I am working on a project using NLTK, and am having an issue with grammar generation. I've looked through a few other questions that exist on here, but I didn't see any that line up with mine.

I have explored using NLTK's CFG and PCFG.fromString(str) with a __ or a ViterbiParser respectively, but I want to be able to send a function a raw string and return its tokens. I have also tried using nltk.pos_tag(nltk.word_tokenize(str)) in hopes that I can manually generate grammar trees from training data, which gets me close but fails given the string My dog also likes eating sausage. :

>>> nltk.pos_tag(nltk.word_tokenize("My dog also likes eating sausage."))
[('My', 'PRP$'), ('dog', 'NN'), ('also', 'RB'), ('likes', 'VBZ'), 
('eating', 'JJ'), ('sausage', 'NN'), ('.', '.')]

The word eating is tagged as an adjective, not a verb VBG .

Does there exist a grammar string/object/module which I can use without having to hard-code a grammar string NLTK? If not, is there a way in NLTK to generate a trained grammar without doing so manually? I don't want to have to reinvent the wheel if I don't have to.

I have found a potential solution to the problem, following a rabbit hole I found after researching what SND 's comment suggested. I was not aware of taggers other than the default one. What I ended up doing was training the default tagger with a built-in corpus (I will try different corpora later) using a series of taggers with backoff to the previous one. This is wrapped into a class for portability later, and is saved using pickle :

import nltk
from nltk.corpus import *
from pickle import dump, load

CUTOFF = 2


class MyTagger:

    def __init__(self, path: str=None):
        self.isTrained = False
        self.path = path
        if path:
            try:
                with open(path, 'rb') as src:
                    self.tagger = load(src)
            except Exception:  # Broad Exception intentional; code not complete
                print("Pickle dump could not be loaded at path: ", path)
                pass

    def train(self, corpus) -> float:
        sents = corpus.tagged_sents(corpus.fileids())
        training_data = sents[int(len(sents) * 0.9):]
        testing_data = sents[:int(len(sents) * 0.1)]
        t0 = nltk.DefaultTagger('NN')
        t1 = nltk.UnigramTagger(training_data, cutoff=CUTOFF, backoff=t0)
        t2 = nltk.BigramTagger(training_data, cutoff=CUTOFF, backoff=t1)
        t3 = nltk.TrigramTagger(training_data, cutoff=CUTOFF, backoff=t2)
        self.tagger = t3
        if self.path:
            self.save(self.path)
        return self.evaluate(testing_data)

    def evaluate(self, corpus: list) -> float:
        return self.tagger.evaluate(corpus)

    def tag(self, text) -> list:
        if not self.tagger:
            raise Exception("")
        else:
            if isinstance(text, str):
                return self.tagger.tag(text.split())
            elif isinstance(text, list) and isinstance(text[0], str):
                return self.tagger.tag(text)

    def save(self, path):
        with open(path, 'wb') as out:
            dump(self.tagger, out, -1)

Usage looks something like this:

i = input("What to tag?  ")
tagger = MyTagger("brown_2.pickle")
try:
    print("Accuracy:", tagger.train(brown))
except ZeroDivisionError:
    pass
print(tagger.tag(i))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM