Cannot perform StratifiedKFold

Question

I want to divide my sample into train/test set respectively 80/20 and after that I want to perform StratifiedKFold.

So let's take some data and divide them into 80/20 using train_test_split

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning- 
databases/breast-cancer-wisconsin/wdbc.data', header=None)
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


X=df.drop(df.columns[[1]], axis=1)
y=np.array(df[1])
y[y=='M']=0
y[y=='B']=1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2).

Now if I want to see result of my division I see error:

 kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)

for train, validation in kfold.split(X, y):
        print(X[train].shape, X[validation].shape)
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'unknown' instead.

I've read about it and it's common error connected to this function, however I'm not sure how to solve the issue.

I saw that we can perform this on iris data:

iris = load_iris()
X = iris.data
y = iris.target
print(X.shape) # initial dataset size
# (150, 4)

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)

for train, validation in kfold.split(X, y):
            print(X[train].shape, X[validation].shape)

And we will se the result? What I'm doing differently that this function doesn't want to work?

Answer 1

You need to recast your y as an array of integers:

y = y.astype(int)

I'm not really sure how it works, but I guess since it started as an array of strings, and was converted one by one (first y=='M' , later y=='B' ) into an array of integers, it just doesn't convert the array itself into an array of integers.

Answer 2

When you convert df[1] to zeros and ones, the type is still a string (or object in numpy):

y=np.array(df[1])
y[y=='M']=0
y[y=='B']=1
y.dtype

dtype('O')

You can either do it as:

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
kfold.get_n_splits(X,df[1])
5

Or:

y = (df[1] == 'B').astype(int)
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
kfold.get_n_splits(X,y)

When you have something in between, it gets confused.

Cannot perform StratifiedKFold

Question

2 answers

solution1
0 2020-12-03 22:25:49

solution2
0 2020-12-03 22:57:00

Cannot perform StratifiedKFold

Question

2 answers

solution1 0 2020-12-03 22:25:49

solution2 0 2020-12-03 22:57:00

solution1
0 2020-12-03 22:25:49

solution2
0 2020-12-03 22:57:00