i have a dataset where I want to split the data set based on the column values. At every iteration, the training set will include all data except those that belong to 2 values which will be kept for test set.
As an example, we have column x
with values a
, b
, c
, d
, e
and f
.
At the moment I am doing a manual selection but since I want to try it for every possible combinations, I am not sure how best to do that.
train = df.loc[~df['x'].isin(['a','b'])]
test = df.loc[df['x'].isin(['a','b'])]
How do I change this code to consider all possible combinations?
I would also like to be able to print these combinations to see the combinations that were used for training and test sets.
Not tested, but how about using itertools.combinations
like:
for holdouts in itertools.combinations(df['x'].unique(), 2):
print(holdouts)
train = df[~df['x'].isin(holdouts)]
test = df[df['x'].isin(holdouts)]
You could save an evaluation by doing mask = df['x'].isin(holdouts)
Note that .loc
isn't necessary for indexing on a boolean
iteratetools.combinations应该有效。
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.