简体   繁体   中英

Python, machine learning - Perform a grid search on custom validation set

I am dealing with an unbalanced classification problem, where my negative class is 1000 times more numerous than my positive class. My strategy is to train a deep neural network on a balanced (50/50 ratio) training set (I have enough simulated samples), and then use an unbalanced (1/1000 ratio) validation set to select the best model and optimise the hyperparameters.

Since the number of parameters is significant, I want to use scikit-learn RandomizedSearchCV , ie a random grid search.

To my understanding, sk-learn GridSearch applies a metric on the training set to select the best set of hyperparameters. In my case however, this means that the GridSearch will select the model that performs best against a balanced training set, and not against more realistic unbalanced data.

My question would be: is there a way to grid search with the performances estimated on a specific, user-defined validation set?

As suggested in comments, the thing you need is PredefinedSplit . It is described in the question here

As about the working, you can see the example given in the documentation:

from sklearn.model_selection import PredefinedSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])

#This is what you need
test_fold = [0, 1, -1, 1]

ps = PredefinedSplit(test_fold)
ps.get_n_splits()
#OUTPUT
2

for train_index, test_index in ps.split():
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

#OUTPUT
TRAIN: [1 2 3] TEST: [0]
TRAIN: [0 2] TEST: [1 3]

As you can see here, you need to assign the test_fold a list of indices, which will be used to split the data. -1 will be used for index of samples, which are not included in validation set.

So in the above code, test_fold = [0, 1, -1, 1] says that in 1st validation set (indices in samples, whose value =0 in test_fold ), index 0. And 2nd is where test_fold have value =1, so index 1 and 3.

But when you say that you have X_train and X_test , if you want your validation set only from X_test , then you need to do the following:

my_test_fold = []

# put -1 here, so they will be in training set
for i in range(len(X_train)):
    my_test_fold.append(-1)

# for all greater indices, assign 0, so they will be put in test set
for i in range(len(X_test)):
    my_test_fold.append(0)

#Combine the X_train and X_test into one array:
import numpy as np

clf = RandomizedSearchCV( ...    cv = PredefinedSplit(test_fold=my_test_fold))
clf.fit(np.concatenate((X_train, X_test), axis=0), np.concatenate((y_train, y_test), axis=0))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM