[英]Python Sklearn train_test_split(): how to set Which Data is Taken for Training?
For the following scikit-learn function: train_test_split()
: 对于以下scikit-learn函数: train_test_split()
:
Is it possible to tell the function where to set the split of the data? 是否可以告诉函数在哪里设置数据分割?
Or in other words: 换句话说:
Can I tell the function that X_train, X_test
should be on the left or right side from the split point and that y_train, y_test
should be on the right side? 我可以告诉函数X_train, X_test
应该在分割点的左侧还是右侧,而y_train, y_test
应该在分割点的右侧?
(and does the splitting really work this way - or are just arbitrary rows of the input data taken until the split ratio is obeyed?) (并且拆分是否真的以这种方式工作-还是在遵循拆分率之前仅获取输入数据的任意行?)
If it is not possible to tell the function which data should be taken for training and testing: is there any equivalent alternative that is usable for this use case? 如果无法告诉函数应该为训练和测试获取哪些数据:是否有等效的替代方法可用于该用例?
From Scikit Learn documentation: Split arrays or matrices into random train and test subsets.. 从Scikit学习文档:将数组或矩阵拆分为随机训练和测试子集。
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
[0, 1],
[6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
[8, 9]])
>>> y_test
[1, 4]
also you can turn off shuffling: 您也可以关闭混洗:
>>> train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]
The solution using KFold would look like: 使用KFold的解决方案如下所示:
import numpy as np
from sklearn.model_selection import KFold
X = np.arange(20).reshape((10, 2))
y = np.arange(20)
print(X)
print(y)
kf = KFold(n_splits=10)
for train_index, test_index in kf.split(X):
print("TRAIN size: {0:5d} from: {1:5d} to: {2:5d}".format(train_index.size, train_index[0], train_index[train_index.size - 1]))
print("TEST size: {0:5d} from: {1:5d} to: {2:5d}".format(test_index.size, test_index[0], test_index[test_index.size - 1]))
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
results in: 结果是:
[[ 0 1]
[ 2 3]
[ 4 5]
[ 6 7]
[ 8 9]
[10 11]
[12 13]
[14 15]
[16 17]
[18 19]]
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
--
TRAIN size: 9 from: 1 to: 9
TEST size: 1 from: 0 to: 0
--
TRAIN size: 9 from: 0 to: 9
TEST size: 1 from: 1 to: 1
--
TRAIN size: 9 from: 0 to: 9
TEST size: 1 from: 2 to: 2
--
TRAIN size: 9 from: 0 to: 9
TEST size: 1 from: 3 to: 3
--
TRAIN size: 9 from: 0 to: 9
TEST size: 1 from: 4 to: 4
--
TRAIN size: 9 from: 0 to: 9
TEST size: 1 from: 5 to: 5
--
TRAIN size: 9 from: 0 to: 9
TEST size: 1 from: 6 to: 6
--
TRAIN size: 9 from: 0 to: 9
TEST size: 1 from: 7 to: 7
--
TRAIN size: 9 from: 0 to: 9
TEST size: 1 from: 8 to: 8
--
TRAIN size: 9 from: 0 to: 8
TEST size: 1 from: 9 to: 9
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.