简体   繁体   English

数据集训练/测试拆分代码理解

[英]dataset train/test split code understanding

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

I am currently reading the book hands-on ML and I am having some issues with this code.我目前正在阅读《动手机器学习》这本书,但我在使用此代码时遇到了一些问题。 I don't have much experience with python so that might be a reason but let me make my confusion clearer.我对 python 没有太多经验,所以这可能是一个原因,但让我更清楚地说明我的困惑。 In the book, the housing problem requires us to create stratums so the dataset has sufficient instances of each, and we do this with code that I didn't copy here, the code I am showing is used to create the test and train sets, using the specific income categories.在书中,住房问题要求我们创建层,以便数据集有足够的每个实例,我们使用我没有在此处复制的代码来执行此操作,我展示的代码用于创建测试和训练集,使用特定的收入类别。 The 1st and 2nd lines of code are clear, the 3rd is where I get lost.第一行和第二行代码很清楚,第三行是我迷路的地方。 We create a split of test 0.2 train 0.8 but what exactly is happening from then on, what is the for loop used for?我们创建了测试 0.2 和 0.8 的分割,但从那时起到底发生了什么,for 循环用于什么?

I have looked in a couple of pages for info but haven't really found anything that made the situation clear, so I would really appreciate the help.我已经查看了几页信息,但还没有真正找到任何可以说明情况的内容,因此我非常感谢您的帮助。

Thank you in advance for your answers.预先感谢您的回答。

该 for 循环只是获取用于拆分的索引并调用原始数据的那些行以形成训练和测试集。

StratifiedShuffleSplit is better if you are using a K-fold cross-validation, where you divide the training and testing data in different ways and then calculate the mean of a result in K iterations.如果您使用 K 折交叉验证,则 StratifiedShuffleSplit 会更好,您可以用不同的方式划分训练和测试数据,然后在 K 次迭代中计算结果的平均值。

n_splits must equals the K value and in your case K is one, which makes no sense for cross-validation. n_splits必须等于K值,在您的情况下K是 1,这对交叉验证没有意义。 I think you'd better use sklearn.model_selection.train_test_split, which makes more sense.我觉得你最好用sklearn.model_selection.train_test_split,这样更有意义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM