简体   繁体   English

使用 TFIDF 制作模型以使用 Scikit for Python 预测新内容

[英]Keep model made with TFIDF for predicting new content using Scikit for Python

this is a sentiment analysis model made with tf-idf for feature extraction i want to know how can i save this model and reuse it.这是一个用 tf-idf 制作的用于特征提取的情感分析模型我想知道如何保存这个模型并重用它。 i tried saving it this way but when i load it , do same pre-processing on the test text and fit_transform on it it gave an error that the model expected X numbers of features but got Y我尝试以这种方式保存它,但是当我加载它时,对测试文本和 fit_transform 进行相同的预处理,它给出了一个错误,模型期望 X 个特征但得到 Y

this is how i saved it这就是我保存它的方式

filename = "model.joblib"
joblib.dump(model, filename)

and this is the code for my tf-idf model这是我的 tf-idf 模型的代码

import pandas as pd
import re
import nltk
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords')
from nltk.corpus import stopwords

processed_text = ['List of pre-processed text'] 
y = ['List of labels']
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X = tfidfconverter.fit_transform(processed_text).toarray()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

text_classifier = BernoulliNB()
text_classifier.fit(X_train, y_train)

predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
print(accuracy_score(y_test, predictions))

edit: just to exact where to put every line so after:编辑:只是为了准确地将每一行放在哪里之后:

tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))

then然后

tfidf_obj = tfidfconverter.fit(processed_text)//this is what will be used again
joblib.dump(tfidf_obj, 'tf-idf.joblib')

then you do the rest of the steps you will save the classifier after training as well so after:然后你做剩下的步骤,你将在训练后保存分类器,所以在之后:

text_classifier.fit(X_train, y_train)

put joblib.dump(model, "classifier.joblib") now when you want to predict any text当你想预测任何文本时,现在放 joblib.dump(model, "classifier.joblib")

tf_idf_converter = joblib.load("tf-idf.joblib")
classifier = joblib.load("classifier.joblib")

now u have List of sentences to predict现在你有要预测的句子列表

sent = []
classifier.predict(tf_idf_converter.transform(sent))

now print that for a list of sentiments for each sentece现在打印每个句子的情绪列表

You can first fit tfidf to your training set using:您可以首先使用以下方法将tfidf拟合到您的训练集:

tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
tfidf_obj = tfidfconverter.fit(processed_text)

Then find a way to store the tfidf_obj for instance using pickle or joblib eg:然后找到一种方法来存储tfidf_obj例如使用picklejoblib例如:

joblib.dump(tfidf_obj, filename)

Then load the saved tfidf_obj and apply transform only on your test set然后加载保存的tfidf_obj并仅在您的测试集上应用transform

loaded_tfidf = joblib.load(filename)
test_new = loaded_tfidf.transform(X_test)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM