简体   繁体   中英

NLTK NaiveBayesClassifier is extremely slow in Python?

I'm using the NLTK NaiveBayesClassifier for Sentiment analysis. The whole thing is incredibly slow. I've tried even saving my trainer data so I don't have to retrain each time, I notice no difference in speed/time..

To save:

import cPickle
f = open('my_classifier.pickle', 'wb')
pickle.dump(classifier, f)
f.close()

To load later:

import cPickle
f = open('my_classifier.pickle')
classifier = pickle.load(f)
f.close()

What else can I do to just improve the speed? It takes 6 seconds for analyses a sentence.. I would like <1 second (I'm running this on a website).

*Now I've changed to saving/loading with cPickle instead of pickle and the performance has dropped to 3 seconds!

NLTK is a teaching toolkit; it's not really optimized for speed. If you want a fast naive Bayes classifier, use the one from scikit-learn . There's a wrapper for this in NLTK (although straight scikit-learn will still be faster).

Furthermore, scikit-learn models can be loaded quickly if you use memory mapping. First, train the model and store it with

# Let "clf" be your classifier, usually a Pipeline of CountVectorizer
# and MultinomialNB
from sklearn.externals import joblib
joblib.dump(clf, SOME_PATH, compress=0)  # turn off compression

and load it with

clf = joblib.load(SOME_PATH, mmap_mode='r')

This also allows sharing the model between worker processes cheaply.

If it's still too slow, then make sure you process batches of documents instead of one at a time. That can be orders of magnitude faster.

Disclaimer: I wrote much of the naive Bayes in scikit-learn and the NLTK scikit-learn wrapper code.

I guess that the pickle save format just saves the training data and it re-calculates the model again every time you load it.

You shouldn't reload the classifier every time you classify a sentence. Can you write the web service in such a way that it can process more than one request at a time?

I never used Asp.net and IIS. I looked around and it seems like it is possible to configure IIS to use FastCGI by installing this extension ( here the configuration instructions). How to write your python script so that it is compatible with FastCGI is explained here .

If you really are pulling in 4 million 15,000 features to analyze maybe a dozen words, most of the features won't be used. This suggests using some sort of disk-based database for the features instead, and pulling in only the ones you need. Even for a long sentence and an inefficient database, 4 seeks x 50 words is still way less than what you see now -- maybe hundreds of milliseconds in the worst case, but certainly not multiple seconds.

Look at anydbm with an NDBM or GDBM back-end for a start, then maybe consider other back-ends depending on familiarity and availability.


Your follow-up comments seem to suggest a basic misunderstanding of what you are doing and/or how things are supposed to work. Let's make a simple example with five words in the lexicon.

# training
d = { 'good': 1, 'bad': -1, 'excellent': 1, 'poor': -1, 'great': 1 }
c = classifier(d)
with open(f, "classifier.pickle", "w") as f:
    pickle.dump(c, f)


sentences = ['I took a good look', 'Even his bad examples were stunning']

# classifying, stupid version
for sentence in sentences:
    with open(f, "classifier.pickle", "r") as f:
        c = pickle.load(f)
    sentiment = c(sentence)
    # basically,  for word in sentence.split(): if word in d: sentiment += d[word]
    print sentiment, sentence

# classifying, slightly less stupid version
with open(f, "classifier.pickle", "r") as f:
    c = pickle.load(f)
# FastCGI init_end here
for sentence in sentences:
    sentiment = c(sentence)
    print sentiment, sentence

The stupid version appears to be what you are currently experiencing. The slightly less stupid version loads the classifier once, and then runs it on each of the input sentences. This is what FastCGI will do for you: you can do the loading part in the process start-up once, and then have a service running which runs it on input sentences as they come in. This is resource-efficient but a bit of work, because converting your script to FastCGI and setting up the server infrastructure is a hassle. If you expect heavy use, it's definitely the way to go.

But observe that only two features out of the five in the model are actually ever needed. Most of the words in the sentences do not have a sentiment score, and most of the words in the sentiments database are not required to calculate a score for these inputs. So a database implementation would instead look something like (rough pseudocode for the DBM part)

with opendbm("sentiments.db") as d:
    for sentence in sentences:
        sentiment = 0
        for word in sentence.split():
            try:
                sentiment += d[word]
            except KeyError:
                 pass
         print sentiment, sentence

The cost per transaction is higher, so it is less optimal than the FastCGI version, which only loads the whole model into memory at start-up; but it does not require you to keep state or set up the FastCGI infrastructure, and it is a lot more efficient than the stupid version which loads the entire model for each sentence.

(In reality, for a web service without FastCGI, you would effectively have the opendbm inside the for instead of the other way around.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM