简体   繁体   中英

How does H2O select best variables for GLM

I put my predictor variables in the gridsearch below. As far as I understood, this gridsearch selects the best variables that should be used in our model and throws away the others. However, I do not know based on which algorithm/ selection metric it selects the best variables. Can somebody tell me how it selects the variables to keep and the variables to throw away?

The function:

  grid.f <-               h2o.grid(algorithm = "glm",                                     # Setting algorithm type
                                   grid_id = "grid.f",                                    # Id so retrieving information on iterations will be easier later
                                   x = predictors,                                        # Setting predictive features
                                   y = response,                                          # Setting target variable
                                   training_frame = data,                                 # Setting training set
                                   hyper_params = hyper_parameters,                       # Setting apha values for iterations
                                   remove_collinear_columns = T,                          # Parameter to remove collinear columns
                                   lambda_search = T,                                     # Setting parameter to find optimal lambda value
                                   seed = p.seed,                                         # Setting to ensure replicateable results
                                   keep_cross_validation_predictions = F,                 # Setting to save cross validation predictions
                                   compute_p_values = F,                                  # Calculating p-values of the coefficients
                                   family = family,                                       # Distribution type used
                                   standardize = T,                                       # Standardizing continuous variables
                                   nfolds = p.folds,                                      # Number of cross-validations
                                   #max_active_predictors = p.max,                         # Setting for number of features
                                   fold_assignment = "Modulo",                            # Specifying fold assignment type to use for cross validations
                                   link = p.link)                                         # Link function for distribution

Even without grid search, H2O-3's GLM uses L1 regularization (aka "lasso") to figure out which variables it can penalize out of the model.

Elastic net is the blending of L1 (lasso) and L2 (ridge regression), and is controlled by the alpha and lambda parameters.

The GLM booklet is a good reference on the details:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM