简体   繁体   English

如何在sklearn中分别创建训练数据集和测试数据集?

[英]How to create train dataset and test dataset separately in sklearn?

I have a fixed training dataset file train.csv and another test dataset file test.csv . 我有一个固定的训练数据集文件train.csv和另一个测试数据集文件test.csv I know train_test_split() method in sklearn can do split work. 我知道train_test_split()方法可以完成拆分工作。 But I want to create 2 datasets seperately with each dataset from exactly each file. 但是我想分别从每个文件中分别创建2个数据集。

I have tested 我测试过

# The X,Y and X_, Y_ following are training and test samples/labels (dataframes)
trainX, testX, trainY, testY = train_test_split( X, Y, test_size = 0)
trainX_, testX_, trainY_, testY_ = train_test_split( X_, Y_, test_size = 1.0)  
                                 # not accepted parameter
# ...
dtree = tree.DecisionTreeClassifier(criterion="gini")
dtree.fit(trainX, trainY)
...
Y_pred = dtree.predict(testX_)

and take trainX, trainY to train, take testX_, testY_ to predict. 并选择trainX, trainY进行训练,使用testX_, testY_进行预测。
However, train_test_split() method doesn't accept test_size=1.0 , leading to a failure. 但是, train_test_split()方法不接受test_size=1.0 ,从而导致失败。

So what's the right way to create training and test datasets separately? 那么分别创建训练和测试数据集的正确方法是什么?

The purpose of train_test_split is to create both a train and a test set with random sampling. train_test_split的目的是使用随机采样创建火车和测试集。 If you want to use all of X_, y_ as a holdout set to test on, then you don't need to split it at all and rather just split X, y . 如果要使用所有X_, y_作为要测试的保持集,则完全不需要拆分它,而只需拆分X, y If you already have 2 files, then you can just use dtree.fit(X, y) and dtree.score(X_, y_) , assuming you're happy with both sets being accurate and random samples of the data 如果您已经有2个文件,则可以使用dtree.fit(X, y)dtree.score(X_, y_)dtree.score(X_, y_)是您对这两组数据都是准确且随机的数据样本感到满意

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何创建我自己的数据集以训练/测试卷积神经网络 - How to create my own dataset to train/test a convolutional neural network 基于 sklearn ColumnTransformer 的预处理器在训练和测试数据集上输出不同的列 - sklearn ColumnTransformer based preprocessor outputs different columns on Train and Test dataset 如何准备图像数据集以训练和测试张量流 - How to prepare a dataset of images to train and test tensorflow 使用数据集 A 训练 model 并使用数据集 B 进行测试 - train model with dataset A and test with dataset B TensorFlow 数据集训练/测试拆分 - TensorFlow Dataset train/test split 训练测试数据集回归结果 - Train test dataset regression results 如何在 tf 2.1.0 中创建 tf.data.Dataset 的训练、测试和验证拆分 - how to create train, test & validation split of tf.data.Dataset in tf 2.1.0 如何将此数据集拆分为训练集、验证集和测试集? - How can I split this dataset into train, validation, and test set? 如何将数据表 dataframe 拆分为 python 中的训练和测试数据集 - How to split datatable dataframe into train and test dataset in python 如何正确拆分不平衡数据集以训练和测试集? - How can I properly split imbalanced dataset to train and test set?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM