[英]Creating train/test CSV files for 10 fold cross validation experiment
I have a CSV file (main.csv) that has a unique column ID that also pertains to my image names (minus their .jpg extension).我有一个 CSV 文件 (main.csv),它有一个唯一的列 ID,它也与我的图像名称有关(减去它们的 .jpg 扩展名)。
I want to do 10 fold cross-validation and create a train and test CSV's such that the test CSV for each fold would only contain 10 percent of the original CSV.我想做 10 折交叉验证并创建一个训练和测试 CSV,这样每个折的测试 CSV 将只包含原始 CSV 的 10%。
Is there a straightforward path (already done) to do this?是否有一条简单的路径(已经完成)来做到这一点?
Basically, I want my eventual train and test CSV files to have the same exact column names but designed such that I could perform 10 fold cross validation with them (aka randomly sampled/shuffled and 10% selected).基本上,我希望我最终的训练和测试 CSV 文件具有完全相同的列名,但设计成可以对它们执行 10 折交叉验证(也就是随机采样/混洗和 10% 选择)。
I don't mind using pandas in Python or R.我不介意在 Python 或 R 中使用 Pandas。
I am not planning to use Scikit-learn for cross-validation as I am using my own manual code that is why I need the chopped train and test CSV for each of the folds.我不打算使用 Scikit-learn 进行交叉验证,因为我使用的是我自己的手动代码,这就是为什么我需要对每个折叠进行切碎的训练和测试 CSV。
Perhaps, you could be looking for this:也许,你可能正在寻找这个:
from sklearn.model_selection import train_test_split
#X contains the dependent columns from the CSV file, and Y is the predicted variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
test_size=0.33
This parameter says how much percentage of test data you want to split. test_size=0.33
此参数表示您要拆分的测试数据的百分比。 All others would be train data.所有其他的都是训练数据。
X_train.to_csv(file_name, encoding='utf-8', index=False)
This code will save the X_train data to a CSV file which is 33%此代码将 X_train 数据保存到 33% 的 CSV 文件
y_train.to_csv(file_name, encoding='utf-8', index=False)
This code will save y_train data to CSV file which is 77%此代码将 y_train 数据保存到 77% 的 CSV 文件
This way, you can change the random_state
value in the code each time and save file, so that you would get a good suffle.这样,您可以每次更改代码中的
random_state
值并保存文件,从而获得良好的 suffle。 The number does not signify anything.数字不代表任何东西。 It randomly shuffles and splits the dataset.
它随机打乱和拆分数据集。 (Perhaps, if we know the logic behind each number, it wont be a random split anymore !! :))
(也许,如果我们知道每个数字背后的逻辑,它就不再是随机拆分了!!:))
After this you can apply manual K fold.在此之后,您可以应用手动 K 折。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.