为 10 折交叉验证实验创建训练/测试 CSV 文件

Question

I have a CSV file (main.csv) that has a unique column ID that also pertains to my image names (minus their .jpg extension).我有一个 CSV 文件 (main.csv)，它有一个唯一的列 ID，它也与我的图像名称有关（减去它们的 .jpg 扩展名）。

I want to do 10 fold cross-validation and create a train and test CSV's such that the test CSV for each fold would only contain 10 percent of the original CSV.我想做 10 折交叉验证并创建一个训练和测试 CSV，这样每个折的测试 CSV 将只包含原始 CSV 的 10%。

Is there a straightforward path (already done) to do this?是否有一条简单的路径（已经完成）来做到这一点？

Basically, I want my eventual train and test CSV files to have the same exact column names but designed such that I could perform 10 fold cross validation with them (aka randomly sampled/shuffled and 10% selected).基本上，我希望我最终的训练和测试 CSV 文件具有完全相同的列名，但设计成可以对它们执行 10 折交叉验证（也就是随机采样/混洗和 10% 选择）。

I don't mind using pandas in Python or R.我不介意在 Python 或 R 中使用 Pandas。

I am not planning to use Scikit-learn for cross-validation as I am using my own manual code that is why I need the chopped train and test CSV for each of the folds.我不打算使用 Scikit-learn 进行交叉验证，因为我使用的是我自己的手动代码，这就是为什么我需要对每个折叠进行切碎的训练和测试 CSV。

Answer 1

Perhaps, you could be looking for this:也许，你可能正在寻找这个：

from sklearn.model_selection import train_test_split
#X contains the dependent columns from the CSV file, and Y is the predicted variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

test_size=0.33 This parameter says how much percentage of test data you want to split. test_size=0.33此参数表示您要拆分的测试数据的百分比。 All others would be train data.所有其他的都是训练数据。

X_train.to_csv(file_name, encoding='utf-8', index=False)

This code will save the X_train data to a CSV file which is 33%此代码将 X_train 数据保存到 33% 的 CSV 文件

y_train.to_csv(file_name, encoding='utf-8', index=False)

This code will save y_train data to CSV file which is 77%此代码将 y_train 数据保存到 77% 的 CSV 文件

This way, you can change the random_state value in the code each time and save file, so that you would get a good suffle.这样，您可以每次更改代码中的random_state值并保存文件，从而获得良好的 suffle。 The number does not signify anything.数字不代表任何东西。 It randomly shuffles and splits the dataset.它随机打乱和拆分数据集。 (Perhaps, if we know the logic behind each number, it wont be a random split anymore !! :)) （也许，如果我们知道每个数字背后的逻辑，它就不再是随机拆分了！！:)）

After this you can apply manual K fold.在此之后，您可以应用手动 K 折。

为 10 折交叉验证实验创建训练/测试 CSV 文件

问题描述

1 个解决方案

解决方案1
0 2019-02-13 03:50:38

为 10 折交叉验证实验创建训练/测试 CSV 文件

问题描述

1 个解决方案

解决方案1 0 2019-02-13 03:50:38

解决方案1
0 2019-02-13 03:50:38