I am working on a project using NLTK, and am having an issue with grammar generation. I've looked through a few other questions that exist on here, but I didn't see any that line up with mine.
I have explored using NLTK's CFG
and PCFG.fromString(str)
with a __ or a ViterbiParser
respectively, but I want to be able to send a function a raw string and return its tokens. I have also tried using nltk.pos_tag(nltk.word_tokenize(str))
in hopes that I can manually generate grammar trees from training data, which gets me close but fails given the string My dog also likes eating sausage.
:
>>> nltk.pos_tag(nltk.word_tokenize("My dog also likes eating sausage."))
[('My', 'PRP$'), ('dog', 'NN'), ('also', 'RB'), ('likes', 'VBZ'),
('eating', 'JJ'), ('sausage', 'NN'), ('.', '.')]
The word eating
is tagged as an adjective, not a verb VBG
.
Does there exist a grammar string/object/module which I can use without having to hard-code a grammar string NLTK? If not, is there a way in NLTK to generate a trained grammar without doing so manually? I don't want to have to reinvent the wheel if I don't have to.
I have found a potential solution to the problem, following a rabbit hole I found after researching what SND 's comment suggested. I was not aware of taggers other than the default one. What I ended up doing was training the default tagger with a built-in corpus (I will try different corpora later) using a series of taggers with backoff to the previous one. This is wrapped into a class for portability later, and is saved using pickle
:
import nltk
from nltk.corpus import *
from pickle import dump, load
CUTOFF = 2
class MyTagger:
def __init__(self, path: str=None):
self.isTrained = False
self.path = path
if path:
try:
with open(path, 'rb') as src:
self.tagger = load(src)
except Exception: # Broad Exception intentional; code not complete
print("Pickle dump could not be loaded at path: ", path)
pass
def train(self, corpus) -> float:
sents = corpus.tagged_sents(corpus.fileids())
training_data = sents[int(len(sents) * 0.9):]
testing_data = sents[:int(len(sents) * 0.1)]
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(training_data, cutoff=CUTOFF, backoff=t0)
t2 = nltk.BigramTagger(training_data, cutoff=CUTOFF, backoff=t1)
t3 = nltk.TrigramTagger(training_data, cutoff=CUTOFF, backoff=t2)
self.tagger = t3
if self.path:
self.save(self.path)
return self.evaluate(testing_data)
def evaluate(self, corpus: list) -> float:
return self.tagger.evaluate(corpus)
def tag(self, text) -> list:
if not self.tagger:
raise Exception("")
else:
if isinstance(text, str):
return self.tagger.tag(text.split())
elif isinstance(text, list) and isinstance(text[0], str):
return self.tagger.tag(text)
def save(self, path):
with open(path, 'wb') as out:
dump(self.tagger, out, -1)
Usage looks something like this:
i = input("What to tag? ")
tagger = MyTagger("brown_2.pickle")
try:
print("Accuracy:", tagger.train(brown))
except ZeroDivisionError:
pass
print(tagger.tag(i))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.