![](/img/trans.png)
[英]h2o ensemble throws error: “Base model does not keep cross-validation predictions”
[英]Can we give a custom metric for cross-validation with GLM in H2O?
我正在嘗試使用h2o.glm
通過交叉驗證找到最佳懲罰lambda
。 這是一個多項式模型。
但是,我看到它正在根據多項式偏差進行優化。 我可以對其他一些指標(例如誤分類錯誤)進行交叉驗證嗎?
文檔中提到了參數custom_metric_func
,但我不清楚它的描述。 該指標是否用作交叉驗證分數? 如果是,文檔還聲明它僅在Python
API 中可用。 這是真的嗎?
如果您keep_cross_validation_models = TRUE, keep_cross_validation_predictions = TRUE,
在 h2o 上工作,那么不使用 h2o 離開 R 接口的合適選項是使用選項keep_cross_validation_models = TRUE, keep_cross_validation_predictions = TRUE,
。 由此,您可以在每個模型上構建每個類別的錯誤分類錯誤,該模型裝有特定的 lambda 值序列。 或者,您可以通過一系列 lambda 循環或 lapply。 例如for (i in lambda_vector){ models[[i]]= h2o.glm(...,lambda= i )}
。 每個對象都有一個混淆矩陣,因此您可以隨意計算每個類別的分類錯誤。 您可以制定自己的選擇標准。 自定義指標僅在 python 中有效。
如果您只能使用 R:
為了擬合具有彈性懲罰的多項式模型,我建議,如果沒有特殊原因與 h2o 相關聯,則使用提供命令 cv.glmnet() 的包glmnet和選項 family="multinomial" 並鍵入.measure=“類”。 這將產生通過對分類錯誤的交叉驗證選擇的多項式模型。
我寫h2o.glm_custom
作為“替代”為h2o.glm
,保持了交叉驗證模型,從而使自定義選擇標准,可以使用后記,通過@Diegolog的建議。 我的方法使用h2o.grid
。 我試圖包含h2o.glm
所有參數,但簡化了一些默認值以避免額外的工作。
h2o.glm_custom <- function(x,
y,
training_frame,
model_id = NULL,
validation_frame = NULL,
nfolds = 0,
seed = -1,
keep_cross_validation_models = TRUE,
keep_cross_validation_predictions = FALSE,
keep_cross_validation_fold_assignment = FALSE,
fold_assignment = "AUTO",
fold_column = NULL,
random_columns = NULL,
ignore_const_cols = TRUE,
score_each_iteration = FALSE,
offset_column = NULL,
weights_column = NULL,
family = "binomial",
rand_family = c("[gaussian]"),
tweedie_variance_power = 0,
tweedie_link_power = 1,
theta = 1e-10,
solver = "AUTO",
alpha = 0,
early_stopping = TRUE,
nlambdas = 100,
standardize = TRUE,
missing_values_handling = "MeanImputation",
plug_values = NULL,
compute_p_values = FALSE,
remove_collinear_columns = FALSE,
intercept = TRUE,
non_negative = FALSE,
max_iterations = -1,
objective_epsilon = -1,
beta_epsilon = 1e-04,
gradient_epsilon = -1,
link = "family_default",
rand_link = "[identity]",
startval = NULL,
calc_like = FALSE,
HGLM = FALSE,
prior = -1,
lambda_min_ratio = 0.01,
beta_constraints = NULL,
max_active_predictors = -1,
obj_reg = -1,
export_checkpoints_dir = NULL,
balance_classes = FALSE,
class_sampling_factors = NULL,
max_after_balance_size = 5,
max_hit_ratio_k = 0,
max_runtime_secs = 0,
custom_metric_func = NULL) {
# Find lambda_max
model <- h2o.glm(x,
y,
training_frame,
model_id,
validation_frame,
nfolds,
seed,
keep_cross_validation_models,
keep_cross_validation_predictions,
keep_cross_validation_fold_assignment,
fold_assignment,
fold_column,
random_columns,
ignore_const_cols,
score_each_iteration,
offset_column,
weights_column,
family,
rand_family,
tweedie_variance_power,
tweedie_link_power,
theta,
solver,
alpha,
NULL, # lambda
TRUE, # lambda_search
early_stopping,
1, # nlambdas
standardize,
missing_values_handling,
plug_values,
compute_p_values,
remove_collinear_columns,
intercept,
non_negative,
max_iterations,
objective_epsilon,
beta_epsilon,
gradient_epsilon,
link,
rand_link,
startval,
calc_like,
HGLM,
prior,
lambda_min_ratio,
beta_constraints,
max_active_predictors,
obj_reg = obj_reg,
export_checkpoints_dir = export_checkpoints_dir,
balance_classes = balance_classes,
class_sampling_factor = class_sampling_factors,
max_after_balance_size = max_after_balance_size,
max_hit_ratio_k = max_hit_ratio_k,
max_runtime_secs = max_runtime_secs,
custom_metric_func = custom_metric_func)
lambda_max <- model@model$lambda_best
# Perform grid search on lambda, with logarithmic scale
lambda_min <- lambda_max * lambda_min_ratio
grid <- exp(seq(log(lambda_max), log(lambda_min), length.out = nlambdas))
grid_list <- lapply(sapply(grid, list), list)
hyper_parameters <- list(lambda = grid_list)
result <- h2o.grid('glm',
x = x,
y = y,
training_frame = training_frame,
nfolds = nfolds,
family = family,
alpha = alpha,
ignore_const_cols = ignore_const_cols,
hyper_params = hyper_parameters,
seed = seed)
}
然后可以使用以下函數根據錯誤分類錯誤選擇 lambda:
get_cv_means <- function(grid_results) {
mean_errors <- lapply(grid_results@model_ids, function(id) {
model <- h2o.getModel(id)
lambda <- model@parameters$lambda
err <- as.numeric(model@model$cross_validation_metrics_summary['err', 'mean'])
data.frame(lambda = lambda, error = err)
})
dt <- data.table::rbindlist(mean_errors)
data.table::setkey(dt, lambda)
dt
}
這是一個完整的示例,使用這些函數根據錯誤分類錯誤使用交叉驗證來選擇 lambda:
h2o.init()
path <- system.file("extdata", "prostate.csv", package= "h2o")
h2o_df <- h2o.importFile(path)
h2o_df$CAPSULE <- as.factor(h2o_df$CAPSULE)
lambda_min_ratio <- 0.000001
nlambdas <- 100
nfolds <- 20
result <- h2o.glm_custom(x = c("AGE", "RACE", "PSA", "GLEASON"),
y = "CAPSULE",
training_frame = h2o_df,
family = "binomial",
alpha = 1,
nfolds = nfolds,
lambda_min_ratio = lambda_min_ratio,
nlambdas = nlambdas,
early_stopping = TRUE)
tbl <- get_cv_means(result)
給出:
> head(tbl)
lambda error
1: 2.222376e-07 0.2264758
2: 2.555193e-07 0.2394541
3: 2.937851e-07 0.2380508
4: 3.377814e-07 0.2595451
5: 3.883666e-07 0.2478443
6: 4.465272e-07 0.2595603
可以繪制,等等...
ggplot() + geom_line(data = tbl[lambda < 0.00001], aes(x = lambda, y = error))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.