相当于 Python 中 R 的 createDataPartition

Question

I am trying to reproduce the behavior of the R's createDataPartition function in python.我正在尝试在 python 中重现 R 的 createDataPartition 函数的行为。 I have a dataset for machine learning with the boolean target variable.我有一个带有布尔目标变量的机器学习数据集。 I would like to split my dataset in a training set (60%) and a testing set (40%).我想将我的数据集拆分为训练集 (60%) 和测试集 (40%)。

If I do it totally random, my target variable won't be properly distributed between the two sets.如果我完全随机进行，我的目标变量将不会在两组之间正确分配。

I achieve it in R using:我在 R 中使用：

inTrain <- createDataPartition(y=data$repeater, p=0.6, list=F)
training <- data[inTrain,]
testing <- data[-inTrain,]

How can I do the same in Python?我如何在 Python 中做同样的事情？

PS : I am using scikit-learn as my machine learning lib and python pandas. PS：我使用 scikit-learn 作为我的机器学习库和 python pandas。

Answer 1

In scikit-learn, you get the tool train_test_split在 scikit-learn 中，你得到了工具train_test_split

from sklearn.cross_validation import train_test_split
from sklearn import datasets

# Use Age and Weight to predict a value for the food someone chooses
X_train, X_test, y_train, y_test = train_test_split(table['Age', 'Weight'], 
                                                    table['Food Choice'], 
                                                    test_size=0.25)

# Another example using the sklearn pre-loaded datasets:
iris = datasets.load_iris()
X_iris, y_iris = iris.data, iris.target
X, y = X_iris[:, :2], y_iris
X_train, X_test, y_train, y_test = train_test_split(X, y)

This breaks the data in to这将数据分解为

inputs for training培训投入
inputs for the evaluation data评估数据的输入
output for the training data训练数据的输出
output for the evaluation data评估数据的输出

respectively.分别。 You can also add a keyword argument: test_size=0.25 to vary the percentage of the data used for training and testing您还可以添加关键字参数：test_size=0.25 以改变用于训练和测试的数据百分比

To split a single dataset, you can use a call like this to get 40% test data:要拆分单个数据集，您可以使用这样的调用来获取 40% 的测试数据：

>>> data = np.arange(700).reshape((100, 7))
>>> training, testing = train_test_split(data, test_size=0.4)
>>> print len(data)
100
>>> print len(training)
60
>>> print len(testing)
40

Answer 2

The correct answer is sklearn.model_selection.StratifiedShuffleSplit正确答案是 sklearn.model_selection.StratifiedShuffleSplit

Stratified ShuffleSplit cross-validator分层 ShuffleSplit 交叉验证器

Provides train/test indices to split data into train/test sets.提供训练/测试索引以将数据拆分为训练/测试集。

This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds.这个交叉验证对象是 StratifiedKFold 和 ShuffleSplit 的合并，它返回分层的随机折叠。 The folds are made by preserving the percentage of samples for each class.通过保留每个类别的样本百分比来进行折叠。

Note: like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.注意：与 ShuffleSplit 策略一样，分层随机拆分并不能保证所有折叠都会不同，尽管对于大型数据集来说这仍然很有可能。

Answer 3

The answer provided is not correct.提供的答案不正确。 Apparently there is no function in python that can do stratified sampling , not random sampling, like DataPartition in R does.显然，python 中没有函数可以进行分层采样，而不是随机采样，就像 R 中的 DataPartition 那样。

Answer 4

As mentioned in the comments, the selected answer does not preserve the class distribution of the data.如评论中所述，所选答案不会保留数据的类分布。 The scikit-learn docs point out that if is required, then the StratifiedShuffleSplit should be used. scikit-learn 文档指出，如果需要，则应使用StratifiedShuffleSplit 。 This can be done with the train_test_split method with by passing your target array to the stratify option.这可以通过train_test_split方法通过将目标数组传递给分层选项来完成。

>>> import numpy as np
>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split

>>> X, y = datasets.load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, stratify=y, random_state=42)

>>> # show counts of each type after split
>>> print(np.unique(y, return_counts=True))
(array([0, 1, 2]), array([50, 50, 50], dtype=int64))
>>> print(np.unique(y_test, return_counts=True))
(array([0, 1, 2]), array([16, 17, 17], dtype=int64))
>>> print(np.unique(y_train, return_counts=True))
(array([0, 1, 2]), array([34, 33, 33], dtype=int64))

相当于 Python 中 R 的 createDataPartition

问题描述

4 个解决方案

解决方案1
3 已采纳 2014-10-27 13:01:11

解决方案2
1 2020-02-06 12:10:28

解决方案3
0 2019-02-27 11:35:16

解决方案4
0 2020-04-03 03:33:30

相当于 Python 中 R 的 createDataPartition

问题描述

4 个解决方案

解决方案1 3 已采纳 2014-10-27 13:01:11

解决方案2 1 2020-02-06 12:10:28

解决方案3 0 2019-02-27 11:35:16

解决方案4 0 2020-04-03 03:33:30

解决方案1
3 已采纳 2014-10-27 13:01:11

解决方案2
1 2020-02-06 12:10:28

解决方案3
0 2019-02-27 11:35:16

解决方案4
0 2020-04-03 03:33:30