简体   繁体   中英

How to incrementally train an nltk classifier

I am working on a project to classify snippets of text using the python nltk module and the naivebayes classifier. I am able to train on corpus data and classify another set of data but would like to feed additional training information into the classifier after initial training.

If I'm not mistaken, there doesn't appear to be a way to do this, in that the NaiveBayesClassifier.train method takes a complete set of training data. Is there a way to add to the the training data without feeding in the original featureset?

I'm open to suggestions including other classifiers that can accept new training data over time.

There's 2 options that I know of:

1) Periodically retrain the classifier on the new data. You'd accumulate new training data in a corpus (that already contains the original training data), then every few hours, retrain & reload the classifier. This is probably the simplest solution.

2) Externalize the internal model, then update it manually. The NaiveBayesClassifier can be created directly by giving it a label_prodist and a feature_probdist . You could create these separately, pass them in to a NaiveBayesClassifier , then update them whenever new data comes in. The classifier would use this new data immediately. You'd have to look at the train method for details on how to update the probability distributions.

I'm just learning NLTK, so please correct me if I'm wrong. This is using the Python 3 branch of NLTK, which might be incompatible.

There is an update() method to the NaiveBayesClassifier instance, which appears to add to the training data:

from textblob.classifiers import NaiveBayesClassifier

train = [
    ('training test totally tubular', 't'),
]

cl = NaiveBayesClassifier(train)
cl.update([('super speeding special sport', 's')])

print('t', cl.classify('tubular test'))
print('s', cl.classify('super special'))

This prints out:

t t
s s

As Jacob said, the second method is the right way And hopefully someone write a code

Look

https://baali.wordpress.com/2012/01/25/incrementally-training-nltk-classifier/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM