Sci-Kit學習分類器加載問題：詞匯不正確或為空！使用轉換時

Question

我正在通過https://www.kaggle.com/c/word2vec-nlp-tutorial上的Sci-Kit學習教程進行學習。

我通過將教程分成2個文件來稍微偏離本教程，其中一個文件用於訓練分類器，然后將分類器保存到文件中。 另一個文件，用於加載分類器並預測testFile上的testFile 。 原始程序要求在矢量化器上執行轉換，但是出現錯誤：

Vocabulary wasn't fitted or is empty! at the line :
test_data_features = vectorizer.transform(clean_test_reviews)

我還需要在此文件中初始化矢量化器對象，因為矢量化器位於最后一個文件中。 如果我將行更改為fit_transform ，程序將運行並按預期打印出帶有標簽的文件。 我確實擔心，盡管可能是通過學習測試集上的vocab，然后擬合數組來造成邏輯錯誤。 這是加載分類器，准備測試數組，預測結果並將結果寫入文件的代碼。 我見過的其他答案只是加載pickle文件並嘗試進行預測，但是我不確定如何將clean_test_reviews轉換為正確的數據結構然后傳遞給進行預測。 任何幫助表示贊賞。 謝謝！

##load the classifier

forest = joblib.load(classifier)### put in the name of the classifer, 'filename.pkl'


# Read the test data
test = pd.read_csv(infile, header=0, delimiter="\t", \
quoting=3 ) #infile is testData.tsv

# Verify that there are 25,000 rows and 2 columns
print "Test shape(Rows, Columns of Data):", test.shape

# Create an empty list and append the clean reviews one by one
num_reviews = len(test["review"])    
clean_test_reviews = []

print "Cleaning and parsing the test set...\n"
for i in xrange(0,num_reviews):
    if( (i+1) % 1000 == 0 ):
        print "Review %d of %d\n" % (i+1, num_reviews)
    clean_review = review_to_words( test["review"][i] )
    clean_test_reviews.append( clean_review )

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  


vectorizer = CountVectorizer(analyzer = "word", \ tokenizer = None, \ preprocessor = None, \ stop_words = None, \ max_features = 5000) 
# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()
print "Test data feature shape:", test_data_features.shape

# Take a look at the words in the vocabulary
vocab = vectorizer.get_feature_names()
print vocab

# Use the random forest to make sentiment label predictions
result = forest.predict(test_data_features)

# Copy the results to a pandas dataframe with an "id" column and
# a "sentiment" column
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )

# Use pandas to write the comma-separated output file
output.to_csv( outfile, index=False, quoting=3 ) # "Bag_of_Words_model.csv",

Answer 1

您擔心擔心將CountVectorizer安裝在測試集上是正確的。 使用CountVectorizer ，如果在兩個不同的數據集上調用fit() ，則會得到兩個具有不同詞匯量的不兼容矢量化器。 相反，您應該使用pickle或joblib將矢量化程序保存到文件以及分類程序中。 您當前正在保存。

Answer 2

根據上述David Maust的回答，我能夠修復它。...在第一個文件中，將矢量化器像這樣轉儲：

joblib.dump(vectorizer.vocabulary_, dictionary_file_path) #dictionary_file_path is something like "./Vectorizer/vectorizer.pkl"

請注意vectorizer.vocabulary_屬性上的下划線。 在加載文件中，像這樣加載矢量化器：

vocabulary_to_load =joblib.load(dictionary_file_path)

loaded_vectorizer = CountVectorizer(vocabulary=vocabulary_to_load)

...現在使用矢量化器進行轉換

Sci-Kit學習分類器加載問題：詞匯不正確或為空！使用轉換時

問題描述

2 個解決方案

解決方案1
2 2016-02-04 17:04:40

解決方案2
2 2016-02-05 16:09:03

Sci-Kit學習分類器加載問題：詞匯不正確或為空！ 使用轉換時

問題描述

2 個解決方案

解決方案1 2 2016-02-04 17:04:40

解決方案2 2 2016-02-05 16:09:03

Sci-Kit學習分類器加載問題：詞匯不正確或為空！使用轉換時

解決方案1
2 2016-02-04 17:04:40

解決方案2
2 2016-02-05 16:09:03