简体   繁体   中英

Sklearn RandomizedSearchCV, evaluate each random model

I want to try to optimize the parameters of a RandomForest regression model, in order to find the best trade-off between accuracy and prediction speed. My idea was to use a randomized grid search, and to evaluate the speed/accuracy of each of the tested random parameters configuration.

So, I prepared a parameter grid, and I can run k-fold cv on the training data

    ## parameter grid for random search
    n_estimators = [1, 40, 80, 100, 120]
    max_features = ['auto', 'sqrt']
    max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
    max_depth.append(None)
    min_samples_split = [2, 5, 10]
    min_samples_leaf = [1, 2, 4]
    bootstrap = [True, False]
    random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

    rf = RandomForestRegressor()
    rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, n_jobs = -1)
    rf_random.fit(X_train, y_train)


I found the way to get the parameters of the best model, by using:

rf_random.best_params_

However, I wanted to iterate through all the random models, check their parameter values, evaluate them on the test set and write the values of parameters, accuracy and speed to and output dataframe, so something like:

for model in rf_random:
   start_time_base = time.time()
   y_pred = model.predict(X_test) -> evaluate the current random model on the test data
   time = (time.time()-start_time_base)/X_test.shape[0]
   rmse = mean_squared_error(y_test, y_pred, squared=False)
   params = something to get the values of the parameters for this model
   
   write to dataframe...

Is there a way to do that? Just to be clear, I'm asking about the iteration over models and parameters, not the writing to the dataframe part:) Should I go for a different approach altogether instead?

You get the df you're looking to create with model parameters and CV results by calling rf_random.cv_results_ , which you can instantly put into a df: all_results = pd.DataFrame(rf_random.cv_results_) .

Every time I've seen this used in practice, this has been seen as a good measure of all the metrics you're looking for; what you describe in the question is unnecessary. However if you want to go through with what you describe above (ie. evaluate against a held-out test set rather than cross-validate), you can then go through this df and define a model with each parameter combination in a loop:

for i in range(len(all_results)):

    model = RandomForestRegressor(n_estimators = all_results['n_estimators'][i],
                                  max_features = all_results['max_features'][i],
                                  ...)
    
    model.fit(X_train, y_train)

    start_time_base = time.time()
    y_pred = model.predict(X_test) -> evaluate the current random model on the test data
    time = (time.time()-start_time_base)/X_test.shape[0]

    # Evaluate predictions however you see fit

As the trained model is only kept for the best parameter combination in RandomizedSearchCV, you'll need to retrain the models in this loop.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM