简体   繁体   中英

Sci-Kit Learn Classifier Loading issue : Vocabulary wasn't fitted or is empty! when using transform

I am working through the Sci-Kit learn tutorial at https://www.kaggle.com/c/word2vec-nlp-tutorial .

I deviated slightly from the tutorial by making the tutorial into 2 files, one where the classifier is trained, and saves the classifier to a file. Another file to load the classifier and predict sentiment on a testFile . The original program calls for performing transform on a vectorizer, however I get the error:

Vocabulary wasn't fitted or is empty! at the line :
test_data_features = vectorizer.transform(clean_test_reviews)

I also need to initialize a vectorizer object in this file, as the vectorizer was in the last file. If I change the line to fit_transform , the program runs and does print out a file with label as expected. I do worry that I may have made a logic error though by learning the vocab on the test set, then fitting the array. Here is the code to load the classifier, prepare the test array, predict and write results to file. Other answers I have seen say just load the pickle file and try to predict, but I am not sure how to get the clean_test_reviews into the right data structure to then pass to predict. Any help is appreciated. Thanks!

##load the classifier

forest = joblib.load(classifier)### put in the name of the classifer, 'filename.pkl'


# Read the test data
test = pd.read_csv(infile, header=0, delimiter="\t", \
quoting=3 ) #infile is testData.tsv

# Verify that there are 25,000 rows and 2 columns
print "Test shape(Rows, Columns of Data):", test.shape

# Create an empty list and append the clean reviews one by one
num_reviews = len(test["review"])    
clean_test_reviews = []

print "Cleaning and parsing the test set...\n"
for i in xrange(0,num_reviews):
    if( (i+1) % 1000 == 0 ):
        print "Review %d of %d\n" % (i+1, num_reviews)
    clean_review = review_to_words( test["review"][i] )
    clean_test_reviews.append( clean_review )

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  


vectorizer = CountVectorizer(analyzer = "word", \ tokenizer = None, \ preprocessor = None, \ stop_words = None, \ max_features = 5000) 
# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()
print "Test data feature shape:", test_data_features.shape

# Take a look at the words in the vocabulary
vocab = vectorizer.get_feature_names()
print vocab

# Use the random forest to make sentiment label predictions
result = forest.predict(test_data_features)

# Copy the results to a pandas dataframe with an "id" column and
# a "sentiment" column
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )

# Use pandas to write the comma-separated output file
output.to_csv( outfile, index=False, quoting=3 ) # "Bag_of_Words_model.csv",

You are correct to worry about fitting your CountVectorizer on the test set. When using the CountVectorizer , if you call fit() on two different datasets, you will get two incompatible vectorizers with different vocabulary dicts. Instead you should use pickle or joblib to save the vectorizer to a file as well as the classifier. you are currently saving.

Based on David Maust's answer above, I was able to fix it.... in the first file dump the vectorizer like this:

joblib.dump(vectorizer.vocabulary_, dictionary_file_path) #dictionary_file_path is something like "./Vectorizer/vectorizer.pkl"

Note the underscore on the vectorizer.vocabulary_ attribute. In the loading file, load the vectorizer like this :

vocabulary_to_load =joblib.load(dictionary_file_path)

loaded_vectorizer = CountVectorizer(vocabulary=vocabulary_to_load)

... now use the vectorizer for transform

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM