Getting probabilities of best model for RandomizedSearchCV

Question

I'm using RandomizedSearchCV to get the best parameters with a 10-fold cross-validation and 100 iterations. This works well. But now I would like to also get the probabilities of each predicted test data point (like predict_proba ) from the best performing model.

How can this be done?

I see two options. First, perhaps it is possible to get these probabilities directly from the RandomizedSearchCV or second, getting the best parameters from RandomizedSearchCV and then doing again a 10-fold cross-validation (with the same seed so that I get the same splits) with this best parameters.

Edit: Is the following code correct to get the probabilities of the best performing model? X is the training data and y are the labels and model is my RandomizedSearchCV containing a Pipeline with imputing missing values, standardization and SVM.

cv_outer = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
y_prob = np.empty([y.size, nrClasses]) * np.nan
best_model = model.fit(X, y).best_estimator_

for train, test in cv_outer.split(X, y):
    probas_ = best_model.fit(X[train], y[train]).predict_proba(X[test])
    y_prob[test] = probas_

Answer 1

If I understood it right, you would like to get the individual scores of every sample in your test split for the case with the highest CV score. If that is the case, you have to use one of those CV generators which give you control over split indices, such as those here: http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#cross-validation-generators

If you want to calculate scores of a new test sample with the best performing model, the predict_proba() function of RandomizedSearchCV would suffice, given that your underlying model supports it.

Example:

import numpy
skf = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)
scores = cross_val_score(svc, X, y, cv=skf, n_jobs=-1)
max_score_split = numpy.argmax(scores)

Now that you know that your best model happens at max_score_split , you can get that split yourself and fit your model with it.

train_indices, test_indices = k_fold.split(X)[max_score_split]
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
model.fit(X_train, y_train) # this is your model object that should have been created before

And finally get your predictions by:

model.predict_proba(X_test)

I haven't tested the code myself but should work with minor modifications.

Answer 2

You need to look in cv_results_ this will give you the scores, and mean scores for all of your folds, along with a mean, fitting time etc...

If you want to predict_proba() for each of the iterations, the way to do this would be to loop through the params given in cv_results_ , re-fit the model for each of then, then predict the probabilities, as the individual models are not cached anywhere, as far as I know.

best_params_ will give you the best fit parameters, for if you want to train a model just using the best parameters next time.

See cv_results_ in the information page http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Getting probabilities of best model for RandomizedSearchCV

Question

2 answers

solution1
1 ACCPTED 2018-05-07 13:51:10

solution2
1 2018-05-07 14:26:20

Getting probabilities of best model for RandomizedSearchCV

Question

2 answers

solution1 1 ACCPTED 2018-05-07 13:51:10

solution2 1 2018-05-07 14:26:20

solution1
1 ACCPTED 2018-05-07 13:51:10

solution2
1 2018-05-07 14:26:20