简体繁体 English

创建定义比例的随机训练测试拆分，同时保持每个集合中一个属性的排他性

[英]Create random train-test split of defined proportion while maintaining exclusivity of one attribute in each set

原文 2021-03-16 19:36:58 9 1 python/ scikit-learn/ dataset/ data-science/ train-test-split

I have multiple sets of different lengths and I wish to randomly sort these sets into two supersets such that:我有多个不同长度的集合，我希望将这些集合随机分类为两个超集，这样：

Any one set only appears in one superset and,任何一组只出现在一个超集中，并且，
The sum of the lengths of all sets in a superset is as close as possible to a defined proportion of the sum of the lengths of all sets.超集中所有集合的长度之和尽可能接近所有集合长度之和的定义比例。

Example:例子：

Given the following sets:给定以下集合：

	Set1设置1	Set2第 2 组	Set3第 3 组	Set4第4组	Set5第5组	Set6第6组
Length长度	1 1	2 2	3 3	4 4	5 5	6 6

These are some possible supersets:这些是一些可能的超集：

Target Proportion目标比例	Superset1超集1	Superset2超集2
50% - 50% 50% - 50%	(set1,set2,set3,set4) (1+2+3+4) Total length = 10 (set1,set2,set3,set4) (1+2+3+4) 总长度 = 10	(set5,set6) (5+6) Total length = 11 (set5,set6) (5+6) 总长度 = 11
50% - 50% 50% - 50%	(set4,set6) (4+6) Total length = 10 (set4,set6) (4+6) 总长度 = 10	(set1,set2,set3,set5) (1+2+3+5) Total length = 11 (set1,set2,set3,set5) (1+2+3+5) 总长度 = 11
60% - 40% 60% - 40%	(set2,set5,set6) (2+5+6) Total length = 13 (set2,set5,set6) (2+5+6) 总长度 = 13	(set1,set3,set4) (1+3+4) Total length = 8 (set1,set3,set4) (1+3+4) 总长度 = 8
90% - 10% 90% - 10%	(set2) (2) Total length = 2 (set2) (2) 总长度 = 2	(set1,set3,set4,set5,set6) (1+3+4+5+6) Total length = 19 (set1,set3,set4,set5,set6) (1+3+4+5+6) 总长度 = 19

In reality my sets have lengths in the thousands but I have used small values for simplicity of illustration.实际上，我的集合有数千个长度，但为了简单起见，我使用了较小的值。

The purpose of this task is to split a dataset into training and test sets for machine learning in python and scikit-learn.此任务的目的是将数据集拆分为训练集和测试集，用于 python 和 scikit-learn 中的机器学习。 Usually I would use the train-test split function included with scikit-learn but it is (to my knowledge) inadequate in this case as, rather than a random split of all rows, no row in the training data can share a set with any row in the test data (in this case the 'set' of any given row is one of the columns in the dataset).通常我会使用 scikit-learn 附带的训练测试拆分 function 但（据我所知）在这种情况下是不够的，因为不是随机拆分所有行，训练数据中的任何行都不能与测试数据（在这种情况下，任何给定行的“集合”是数据集中的列之一）。

So far I have simply used the scikit train-test function to split the sets instead of the actual data rows, but depending on the length of the sets, the training and test sets can obviously be way off the desired proportion.到目前为止，我只是简单地使用了 scikit train-test function 来拆分集合而不是实际的数据行，但是根据集合的长度，训练集和测试集显然会偏离预期的比例。

As an analogy, say I have a list of house prices alongside square footage, garden size, and distance to nearest school, and I have the same list for various different countries.打个比方，假设我有一张房价清单，旁边有平方英尺、花园大小和到最近学校的距离，而且我有不同国家/地区的相同清单。 Eventually the task is to predict house prices for countries where we do not have any house price data, but we do have all the other data.最终的任务是预测我们没有任何房价数据但我们拥有所有其他数据的国家的房价。 So in order to evaluate the performance of our prediction algorithm, the training set must contain entirely different countries from the testing set.因此，为了评估我们的预测算法的性能，训练集必须包含与测试集完全不同的国家。

I'm drawing a bit of a blank on how to actually achieve this, presumably there is a name for this general problem but I am unsure what to look for.我对如何实际实现这一点有点空白，大概有这个一般问题的名称，但我不确定要寻找什么。

Any help or pointers greatly appreciated.非常感谢任何帮助或指示。

1 个解决方案

I was able to get this behaviour by using https://github.com/Yoyodyne-Data-Science/GroupStratifiedShuffleSplit .我能够通过使用https://github.com/Yoyodyne-Data-Science/GroupStratifiedShuffleSplit来获得这种行为。 The author describes it thus:作者是这样描述的：

generates stratified and grouped cross validation folds生成分层和分组的交叉验证折叠

This creates the desired split proportion while also ensuring that the groups I need to keep separate are still separate in the split.这将创建所需的拆分比例，同时还确保我需要分开的组在拆分中仍然是分开的。