简体   繁体   English

为 10 折交叉验证实验创建训练/测试 CSV 文件

[英]Creating train/test CSV files for 10 fold cross validation experiment

I have a CSV file (main.csv) that has a unique column ID that also pertains to my image names (minus their .jpg extension).我有一个 CSV 文件 (main.csv),它有一个唯一的列 ID,它也与我的图像名称有关(减去它们的 .jpg 扩展名)。

I want to do 10 fold cross-validation and create a train and test CSV's such that the test CSV for each fold would only contain 10 percent of the original CSV.我想做 10 折交叉验证并创建一个训练和测试 CSV,这样每个折的测试 CSV 将只包含原始 CSV 的 10%。

Is there a straightforward path (already done) to do this?是否有一条简单的路径(已经完成)来做到这一点?

Basically, I want my eventual train and test CSV files to have the same exact column names but designed such that I could perform 10 fold cross validation with them (aka randomly sampled/shuffled and 10% selected).基本上,我希望我最终的训练和测试 CSV 文件具有完全相同的列名,但设计成可以对它们执行 10 折交叉验证(也就是随机采样/混洗和 10% 选择)。

I don't mind using pandas in Python or R.我不介意在 Python 或 R 中使用 Pandas。

I am not planning to use Scikit-learn for cross-validation as I am using my own manual code that is why I need the chopped train and test CSV for each of the folds.我不打算使用 Scikit-learn 进行交叉验证,因为我使用的是我自己的手动代码,这就是为什么我需要对每个折叠进行切碎的训练和测试 CSV。

Perhaps, you could be looking for this:也许,你可能正在寻找这个:

from sklearn.model_selection import train_test_split
#X contains the dependent columns from the CSV file, and Y is the predicted variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

test_size=0.33 This parameter says how much percentage of test data you want to split. test_size=0.33此参数表示您要拆分的测试数据的百分比。 All others would be train data.所有其他的都是训练数据。

X_train.to_csv(file_name, encoding='utf-8', index=False)

This code will save the X_train data to a CSV file which is 33%此代码将 X_train 数据保存到 33% 的 CSV 文件

y_train.to_csv(file_name, encoding='utf-8', index=False)

This code will save y_train data to CSV file which is 77%此代码将 y_train 数据保存到 77% 的 CSV 文件

This way, you can change the random_state value in the code each time and save file, so that you would get a good suffle.这样,您可以每次更改代码中的random_state值并保存文件,从而获得良好的 suffle。 The number does not signify anything.数字不代表任何东西。 It randomly shuffles and splits the dataset.它随机打乱和拆分数据集。 (Perhaps, if we know the logic behind each number, it wont be a random split anymore !! :)) (也许,如果我们知道每个数字背后的逻辑,它就不再是随机拆分了!!:))

After this you can apply manual K fold.在此之后,您可以应用手动 K 折。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 scikit学习:5折交叉验证和培训测试分组 - scikit learn: 5 fold cross validation & train test split 如何在kfold交叉验证中获取每个折叠的训练和测试数据? - How to get the train and test data for each fold in kfold cross validation? 10折交叉验证评估 - 10 fold cross validation evaluation 10折交叉验证python - 10 fold cross validation python 训练和测试数据集正在更改以进行 k 折交叉验证,因此在朴素贝叶斯分类器中的准确性发生了变化 - Train and Test dataset are changing for k-fold cross validation so the accuracy is changed in naive bayes classifier 如何在回归神经网络中使用 k 折交叉验证而不是 train_test_split - How to use k-fold cross-validation instead of train_test_split for Regression Neural Network 应用分层 k 折交叉验证后如何将数据拆分为测试和训练? - How to split data into test and train after applying stratified k-fold cross validation? H2o 交叉验证与单折训练/测试不对应 - H2o cross validation doesn't correspond to single fold train/test 如何拆分交叉验证以拆分火车和测试装置? - How can I do K fold cross-validation for splitting the train and test set? 如何在使用 k 折交叉验证训练训练数据后测试数据? - how to test the data after training the train data with k-fold cross validation?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM