简体   繁体   中英

Train test split for ensuring all categories are included in train set

Let's say there are some 20 categorical columns in the data, each having a different set of unique categorical values. Now a train test split has to done, and one needs to ensure that all unique categories are included in the train set. How can it be done? I have not tried yet, but should all these columns be included in the stratify argument?

Yes. That's correct.

For demonstration, I'm using Melbourne Housing Dataset .

import pandas as pd
from sklearn.model_selection import train_test_split

Meta = pd.read_csv('melb_data.csv')
Meta = Meta[["Rooms", "Type", "Method", "Bathroom"]]
print(Meta.head())

print("\nBefore split -- Method feature distribution\n")
print(Meta.Method.value_counts(normalize=True))
print("\nBefore split -- Type feature distribution\n")
print(Meta.Type.value_counts(normalize=True))

train, test = train_test_split(Meta, test_size = 0.2, stratify=Meta[["Method", "Type"]])

print("\nAfter split -- Method feature distribution\n")
print(train.Method.value_counts(normalize=True))
print("\nAfter split -- Type feature distribution\n")
print(train.Type.value_counts(normalize=True))

Output

Rooms Type Method  Bathroom
0      2    h      S       1.0
1      2    h      S       1.0
2      3    h     SP       2.0
3      3    h     PI       2.0
4      4    h     VB       1.0

Before split -- Method feature distribution

S     0.664359
SP    0.125405
PI    0.115169
VB    0.088292
SA    0.006775
Name: Method, dtype: float64

Before split -- Type feature distribution

h    0.695803
u    0.222165
t    0.082032
Name: Type, dtype: float64

After split -- Method feature distribution

S     0.664396
SP    0.125368
PI    0.115151
VB    0.088273
SA    0.006811
Name: Method, dtype: float64

After split -- Type feature distribution

h    0.695784
u    0.222202
t    0.082014
Name: Type, dtype: float64

you want all categories from all categorical variables to be in your train split.

Using:

train, test = train_test_split(Meta, test_size = 0.2, stratify=Meta[["Method", "Type"]])

ensure that all categories are in the train split and test split . This is more than what you want.

It has to be noticed that the more categorical variables you stratify on, the more probable it is that a combination of categories has only one record associated. If that case occurs, the split won't be done.

Error message:

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM