I am dealing with an unbalanced classification problem, where my negative class is 1000 times more numerous than my positive class. My strategy is to train a deep neural network on a balanced (50/50 ratio) training set (I have enough simulated samples), and then use an unbalanced (1/1000 ratio) validation set to select the best model and optimise the hyperparameters.
Since the number of parameters is significant, I want to use scikit-learn RandomizedSearchCV , ie a random grid search.
To my understanding, sk-learn GridSearch applies a metric on the training set to select the best set of hyperparameters. In my case however, this means that the GridSearch will select the model that performs best against a balanced training set, and not against more realistic unbalanced data.
My question would be: is there a way to grid search with the performances estimated on a specific, user-defined validation set?
As suggested in comments, the thing you need is PredefinedSplit . It is described in the question here
As about the working, you can see the example given in the documentation:
from sklearn.model_selection import PredefinedSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
#This is what you need
test_fold = [0, 1, -1, 1]
ps = PredefinedSplit(test_fold)
ps.get_n_splits()
#OUTPUT
2
for train_index, test_index in ps.split():
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
#OUTPUT
TRAIN: [1 2 3] TEST: [0]
TRAIN: [0 2] TEST: [1 3]
As you can see here, you need to assign the test_fold
a list of indices, which will be used to split the data. -1 will be used for index of samples, which are not included in validation set.
So in the above code, test_fold = [0, 1, -1, 1]
says that in 1st validation set (indices in samples, whose value =0 in test_fold
), index 0. And 2nd is where test_fold have value =1, so index 1 and 3.
But when you say that you have X_train
and X_test
, if you want your validation set only from X_test
, then you need to do the following:
my_test_fold = []
# put -1 here, so they will be in training set
for i in range(len(X_train)):
my_test_fold.append(-1)
# for all greater indices, assign 0, so they will be put in test set
for i in range(len(X_test)):
my_test_fold.append(0)
#Combine the X_train and X_test into one array:
import numpy as np
clf = RandomizedSearchCV( ... cv = PredefinedSplit(test_fold=my_test_fold))
clf.fit(np.concatenate((X_train, X_test), axis=0), np.concatenate((y_train, y_test), axis=0))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.