[英]how can i test my model using different dataset in machine learning
im new in machine learning and i am create a one small project using CountVectorizer model.我是机器学习的新手,我正在使用 CountVectorizer model 创建一个小项目。 i am split my data to 80% -20%.我将我的数据分成 80% -20%。 80% for training model and 20% for testing model. 80% 用于训练 model,20% 用于测试 model。 my model work properly run on 20% test data but can i used to test my model on different data set that is similar to training data set?我的 model 可以在 20% 的测试数据上正常运行,但是我可以用来在类似于训练数据集的不同数据集上测试我的 model 吗?
i am using joblib for dump and load my model.我正在使用 joblib 进行转储并加载我的 model。
from joblib import dump, load
dump(pipe, filename)
loaded_model = load('filename')
my question is how i directly test my model using different dataset?我的问题是我如何使用不同的数据集直接测试我的 model?
Yes, you can use the model to test similar datasets.是的,您可以使用 model 来测试类似的数据集。
However, you must keep in mind the preprocessing step according to the model.但是,您必须牢记根据 model 的预处理步骤。
When you trained your model, it was trained on a particular dimension and the size of input would have been AxB matric.当您训练 model 时,它在特定维度上进行了训练,输入的大小将是 AxB 矩阵。 When you have a new test sentence or new dataset, you must first do the same preprocessing, otherwise, it will throw dimension mismatch errors.当你有一个新的测试句子或新的数据集时,你必须先做同样的预处理,否则会抛出维度不匹配错误。
Example:例子:
suppose you have the following count vectorizer object假设您有以下计数向量器 object
cv = CountVectorizer()
then you must first fit it on your training dataset, for say那么你必须先把它放在你的训练数据集上,比如说
X = dataframe['text_column_name']
X = cv.fit_transform(X) # Fit the Data
Once this is done, whenever you have a new sentence, say完成此操作后,每当您有新句子时,请说
test_sentence = "this is a test sentence"
then you must use the cv object in the following manner那么您必须按以下方式使用 cv object
model_input = cv.transform([test_sentence]).toarray()
and then you can make predictions:然后你可以做出预测:
model.predict(model_input)
This method must be followed even if you wish to test a new dataset which is in a data frame or some other file format.即使您希望测试数据框或其他文件格式中的新数据集,也必须遵循此方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.