简体   繁体   English

使用数据集 A 训练 model 并使用数据集 B 进行测试

[英]train model with dataset A and test with dataset B

In this example I have a hypothetical balanced dataset containing several attributes about college students and one target attribute indicating whether they passed their exam or not (0=fail 1=pass).在这个例子中,我有一个假设的平衡数据集,其中包含关于大学生的几个属性和一个指示他们是否通过考试的目标属性(0=失败 1=通过)。 I have created and fit a GBM model (scikit-learn xgboost) with 75% of my original dataset (18000ish records) and am seeing 80% accuracy and 91.6% precision on my holdout set (4700 records) in regards to students that have failed the exam.我已经创建并安装了一个 GBM model (scikit-learn xgboost),其中包含 75% 的原始数据集(18000 条记录),并且在我的保留集(4700 条记录)上看到了 80% 的准确度和 91.6% 的准确度,关于失败的学生考试。

At this point, I would very much like to now use 100% of this dataset as training data and use a new set of 2000 student records (balanced) as test data.在这一点上,我现在非常想使用这个数据集的 100% 作为训练数据,并使用一组新的 2000 个学生记录(平衡)作为测试数据。 I want to make predictions for dataset B based on the training of dataset A. Ultimately, I would like to offer these predictions to my boss/superior as a way to validate my work and then begin feeding new data to my model in order to predict how future students might perform on that exam.我想根据数据集 A 的训练对数据集 B 进行预测。最终,我想将这些预测提供给我的老板/上级作为验证我的工作的一种方式,然后开始向我的 model 提供新数据以进行预测未来的学生在考试中的表现如何。 I am currently stuck on how to go about using my entire original dataset as my training material and the entire new dataset as testing material.我目前被困在如何使用我的整个原始数据集作为我的训练材料和整个新数据集作为测试材料的 go。

I have attempted to use我试图使用

X = original data minus target feature
y = original data target feature only
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 
0.00001, random_state=0)

and

N = new data minus target feature
z = new data target feature only
N_train, N_test, z_train, z_test = (train_test_split(N, z, test_size = 
.999, random_state=0))

to create my test and train variables.I then attempt to fit and pass new records to my model using:创建我的测试和训练变量。然后我尝试使用以下方法拟合并将新记录传递给我的 model:

# Fit model with original X and y data
xg_class.fit(X_train, y_train)

# Generate predictions based off of X_test
new_preds = xg_class.predict(N_test)

I'm not getting any errors and but my output is FAR lower than my initial results from splitting dataset A.我没有收到任何错误,但我的 output 远低于拆分数据集 A 的初始结果。

Accuracy (75%/25% split of dataset A):  79%
Precision (75%/25% split of dataset A): 91.1% TP / 71.5% TN

Accuracy (99% trained dataset A, tested dataset B): 45%
Precision (99% trained dataset A, tested dataset B): 18.7% TP / 62.4% TN

Is this due to the disparity in size of one/both of my datasets or is this to be expected?这是由于我的一个/两个数据集的大小差异还是可以预期的? From what I'm reading, this could be a methodology issue from using two unique datasets for training and testing.从我正在阅读的内容来看,这可能是使用两个独特的数据集进行训练和测试的方法问题。 However, if that is the case then I don't see what the point in building a model would even be, as it can't be fed unique data with any reasonable expectation of success.但是,如果是这种情况,那么我什至看不出构建 model 的意义何在,因为它无法以任何合理的成功预期提供独特的数据。 I obviously don't believe that to be true, but I haven't found any info through my searching about how one performs this part of model evaluation.我显然不相信这是真的,但我没有通过搜索找到任何关于如何执行 model 评估的这一部分的信息。 If anyone could help me with some general insight, that would be appreciated.如果有人可以帮助我提供一些一般性的见解,那将不胜感激。

turns out part one of my question is an easy answer: do not use train_test_split().原来我的问题的第一部分是一个简单的答案:不要使用 train_test_split()。 you assign your particular algorithm to a variable (ex. 'model') and then fit it with all the data in the same manner as train_test_split.您将您的特定算法分配给一个变量(例如“模型”),然后以与 train_test_split 相同的方式将其与所有数据相匹配。

model.fit(X, y)

you then pass new data (for example, N as feature data and z as the label)然后传递新数据(例如,N 作为特征数据,z 作为标签)

new_predictions = model.predict(N)

the second part of my question still eludes me.我的问题的第二部分仍然让我感到困惑。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM