如何在机器学习中使用不同的数据集测试我的训练 model

Question

Hello I am very new to Python and machine learning and I am running into a issue.您好，我对 Python 和机器学习非常陌生，我遇到了一个问题。 After splitting and completing my training and testing models, now I need to test a complete different dataset.在拆分并完成我的训练和测试模型之后，现在我需要测试一个完全不同的数据集。

Below is how I created my training and test:以下是我创建培训和测试的方式：

Using NaiveBayes Classifier model nb_model = sklearn.naive_bayes.MultinomialNB() nb_model.fit(X_train_v, y_train) y_pred_class = nb_model.predict(X_test_v) y_pred_probs = nb_model.predict_proba(X_test_v)使用 NaiveBayes 分类器 model nb_model = sklearn.naive_bayes.MultinomialNB() nb_model.fit(X_train_v, y_train) y_pred_class = nb_model.predict(X_test_v) y_pred_probs = nb_model.predict_proba(X_test_v)

What would I need to adjust in order to change the dataset that I am using so I can run a new dataset to the training model.我需要调整什么才能更改我正在使用的数据集，以便我可以将新数据集运行到训练 model。

Thank you for your time and your help!感谢您的时间和帮助！

Answer 1

Specifically and functionally speaking, your new dataset should have the same number of features.具体而言，从功能上讲，您的新数据集应该具有相同数量的特征。

If x_train.shape gives (752, 8) , then you know it has 8 features and 752 samples.如果x_train.shape给出(752, 8) ，那么你知道它有 8 个特征和 752 个样本。

After that your model was trained on it, you can be sure that model.n_features will give you 8 .之后，您的 model 接受了培训，您可以确定model.n_features会给您8 。

Your model now is able to predict outputs from data with 8 features:您的 model 现在能够从具有 8 个特征的数据中预测输出：

import numpy as np
# 10 randomly generated samples with 8 features
new_dataset_1 = np.random.randint(0, 100, size=(10, 8))
new_pred_1 = model.predict(new_dataset_1)
# > array([47, 15,  2, 81, 99, 63, 53, 55, 24, 47])
new_pred_1.shape
# > (10, )  # One predicted class per sample

If you try to predict from data that has any other count of features, it will fail:如果您尝试从具有任何其他特征计数的数据中进行预测，它将失败：

# 10 randomly generated samples with 9 features
new_dataset_2 = np.random.randint(0, 100, size=(10, 9))
new_pred_2 = model.predict(new_dataset_2)
# > ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0,
# with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 8 is different from 9)

In other instances, there might be ways to get the same amount of features, but it all depends on the hypothesis, on the kind of data or on the tested model.在其他情况下，可能有办法获得相同数量的特征，但这完全取决于假设、数据类型或测试的 model。

Of course, this is just an illustration and it doesn't make any sense to predict on randomly generated data.当然，这只是一个说明，对随机生成的数据进行预测没有任何意义。 Your new data should instead represent something that is related to the training data.相反，您的新数据应该代表与训练数据相关的内容。

For example, you can consider that it is reasonable to try to predict the reproductive rate of fire ants from Austria with a model that you trained on the reproductive rate of fire ants from Germany.例如，您可以考虑使用您训练的德国火蚁繁殖率的 model 来预测奥地利火蚁的繁殖率是合理的。

如何在机器学习中使用不同的数据集测试我的训练 model

问题描述

1 个解决方案

解决方案1
0 2021-06-08 07:14:33

如何在机器学习中使用不同的数据集测试我的训练 model

问题描述

1 个解决方案

解决方案1 0 2021-06-08 07:14:33

解决方案1
0 2021-06-08 07:14:33