简体   繁体   English

如何在机器学习模型中使用train.csv,test.csv和ground_truth.csv? (交叉验证/ python)

[英]How to use a train.csv , test.csv and ground_truth.csv in a machine learning model? (cross validation/ python)

Up to now I had only one dataset (df.csv). 到目前为止,我只有一个数据集(df.csv)。 So far I used a validation size of 20% and .train_test_split for a normal regression model. 到目前为止,对于正常回归模型,我使用20%的验证大小和.train_test_split

array = df.values
X = array[:,0:26]
Y = array[:,26]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation =
   cross_validation.train_test_split(X, Y,
   test_size=validation_size, random_state=seed)
num_folds = 10
num_instances = len(X_train)
seed = 7
scoring = 'mean_squared_error'

When I have three seperate datasets (train.csv/test.csv/ground_truth.csv), how can I handle it? 当我有三个单独的数据集(train.csv / test.csv / ground_truth.csv)时,该如何处理? Of course, at first I use the train.csv, then the test.csv and finally the ground_truth. 当然,首先我使用train.csv,然后使用test.csv,最后使用ground_truth。 But how should I implement these different datasets in my model? 但是如何在模型中实现这些不同的数据集?

When you perform cross-validation, train and test data are essentially the same dataset which is split in different ways in order to prevent overfitting. 当执行交叉验证时,训练和测试数据本质上是相同的数据集,为了防止过度拟合,它们以不同的方式进行拆分。 The number of folds indicates the different ways the set is split. 折数表示将组合拆分的不同方式。

For example, 5-fold cross validation splits the training set in 5 pieces and each time 4 of them are used for training and 1 for testing. 例如,五折交叉验证将训练集分成5个部分,每次将其中4个用于训练而将1个用于测试。 So in your case, you have the following options: 因此,根据您的情况,您可以选择以下选项:

Either perform cross-validation just on the training set, then check with the test set and the ground truth (fitting is done just on the training set so if done correctly accuracy on test and ground truth ought to be similar) or combine training and test for a larger and possibly more representative dataset and then check on ground truth. 可以只对训练集执行交叉验证,然后与测试集和地面真实性进行核对(拟合仅在训练集上完成,因此,如果正确正确地进行了测试,则地面真实性应与之相似)或将训练与测试相结合以获得更大且可能更具代表性的数据集,然后检查地面真实情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM