简体   繁体   中英

Cannot perform StratifiedKFold

I want to divide my sample into train/test set respectively 80/20 and after that I want to perform StratifiedKFold.

So let's take some data and divide them into 80/20 using train_test_split

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning- 
databases/breast-cancer-wisconsin/wdbc.data', header=None)
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


X=df.drop(df.columns[[1]], axis=1)
y=np.array(df[1])
y[y=='M']=0
y[y=='B']=1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2). 

Now if I want to see result of my division I see error:

 kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)

for train, validation in kfold.split(X, y):
        print(X[train].shape, X[validation].shape)
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'unknown' instead.

I've read about it and it's common error connected to this function, however I'm not sure how to solve the issue.

I saw that we can perform this on iris data:

iris = load_iris()
X = iris.data
y = iris.target
print(X.shape) # initial dataset size
# (150, 4)

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)

for train, validation in kfold.split(X, y):
            print(X[train].shape, X[validation].shape) 

And we will se the result? What I'm doing differently that this function doesn't want to work?

You need to recast your y as an array of integers:

y = y.astype(int)

I'm not really sure how it works, but I guess since it started as an array of strings, and was converted one by one (first y=='M' , later y=='B' ) into an array of integers, it just doesn't convert the array itself into an array of integers.

When you convert df[1] to zeros and ones, the type is still a string (or object in numpy):

y=np.array(df[1])
y[y=='M']=0
y[y=='B']=1
y.dtype

dtype('O')

You can either do it as:

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
kfold.get_n_splits(X,df[1])
5

Or:

y = (df[1] == 'B').astype(int)
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
kfold.get_n_splits(X,y)

When you have something in between, it gets confused.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM