Tidymodel 包：R 中的通用線性模型 (glm) 和決策樹（袋裝樹、提升樹和隨機森林）模型

Question

問題

我正在嘗試使用R 中的Tidymodels 包進行分析。 我正在關注以下有關 R 中決策樹學習的教程：-

教程

https://bcullen.rbind.io/post/2020-06-02-tidymodels-decision-tree-learning-in-r/

我有一個名為 FID （見下文）的數據框，其中因變量是頻率（數字） ，預測變量是：-年（數字）、月（因子）、季風（因子）和天（數字）。

我相信我已經通過構建袋裝樹、隨機森林和提升樹模型成功地遵循了名為“Tidymodels：R 中的決策樹學習”的教程。

對於這個分析，我還想構建一個通用線性模型（glm） ，以便在所有模型（即隨機森林、袋裝樹、提升樹和通用線性模型）之間進行模型比較，以建立最佳模型擬合。 所有模型都經過10 倍交叉驗證以減少過度擬合的偏差。

問題

隨后，我嘗試調整教程中的代碼（請參見下文）以適應 glm 模型，但我很困惑我是否適當地調整了模型。 當我嘗試在模型全部擬合后生成rmse值時，我不確定 glm R-code 的這個元素是否會產生以下錯誤消息：-

錯誤信息

Error: Problem with `mutate()` input `model`.
x Input `model` can't be recycled to size 4.
ℹ Input `model` is `c("bag", "rf", "boost")`.
ℹ Input `model` must be size 4 or 1, not 3.

此外，我不確定這些模型的 recipe() 函數中表達的R 代碼是否足夠或正確，這在擬合每個模型之前的處理步驟中非常重要。 從我的角度來看，我想知道模型的配方是否可以改進。

如果這是可能的，我想知道是否有人可以幫助我了解擬合 glm 模型時的錯誤消息，並結合更正配方（如果有必要）。

我不是一個高級的 R 編碼員，我已經通過研究其他 Tidymodel 教程進行了徹底的嘗試，試圖找到解決方案； 但是，我不明白此錯誤消息的含義。 因此，如果有人能夠提供幫助，我要表示最深切的感謝。

提前謝謝了。

代碼

##Open the tidymodels package
library(tidymodels)
library(glmnet)
library(parsnip)
library(rpart.plot)
library(rpart)
library(tidyverse) # manipulating data
library(skimr) # data visualization
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time
library(xgboost) # boosted trees
library(ranger)

###########################################################
# Put 3/4 of the data into the training set
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(Tidy_df, prop = 3/4)

# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data)

###########################################################
##Produce the recipe
##Preprocessing
############################################################

rec <- recipe(Frequency ~ ., data = fid_df) %>% 
  step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
  step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
  step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
  step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

###########################################################
##Create Models
###########################################################

##########################################################
##General Linear Models
#########################################################

##glm
mod_glm<-linear_reg(mode="regression",
                       penalty = 0.1, 
                       mixture = 1) %>% 
                            set_engine("glmnet")

##Create workflow
wflow_glm <- workflow() %>% 
                add_recipe(rec) %>%
                      add_model(mod_glm)

##Fit the model
plan(multisession)

fit_glm <- fit_resamples(
                        wflow_glm,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE)
                        )

##########################################################
##Bagged Trees
##########################################################

#####Bagged Trees
mod_bag <- bag_tree() %>%
            set_mode("regression") %>%
             set_engine("rpart", times = 10) #10 bootstrap resamples
                

##Create workflow
wflow_bag <- workflow() %>% 
                   add_recipe(rec) %>%
                       add_model(mod_bag)

##Fit the model
plan(multisession)

fit_bag <- fit_resamples(
                      wflow_bag,
                      cv,
                      metrics = metric_set(rmse, rsq),
                      control = control_resamples(save_pred = TRUE)
                      )

###################################################
##Random forests
###################################################

mod_rf <-rand_forest(trees = 1e3) %>%
                              set_engine("ranger",
                              num.threads = parallel::detectCores(), 
                              importance = "permutation", 
                              verbose = TRUE) %>% 
                              set_mode("regression") 
                              
##Create Workflow

wflow_rf <- workflow() %>% 
               add_model(mod_rf) %>% 
                     add_recipe(rec)

##Fit the model

plan(multisession)

fit_rf<-fit_resamples(
             wflow_rf,
             cv,
             metrics = metric_set(rmse, rsq),
             control = control_resamples(save_pred = TRUE)
             )

############################################################
##Boosted Trees
############################################################

mod_boost <- boost_tree() %>% 
                 set_engine("xgboost", nthreads = parallel::detectCores()) %>% 
                      set_mode("regression")

##Create workflow

wflow_boost <- workflow() %>% 
                  add_recipe(rec) %>% 
                    add_model(mod_boost)

##Fit model

plan(multisession)

fit_boost <-fit_resamples(
                       wflow_boost,
                       cv,
                       metrics = metric_set(rmse, rsq),
                       control = control_resamples(save_pred = TRUE)
                       )

##############################################
##Evaluate the models
##############################################

collect_metrics(fit_bag) %>% 
        bind_rows(collect_metrics(fit_rf)) %>%
          bind_rows(collect_metrics(fit_boost)) %>% 
            bind_rows(collect_metrics(fit_glm)) %>% 
              dplyr::filter(.metric == "rmse") %>% 
                dplyr::mutate(model = c("bag", "rf", "boost")) %>% 
                 dplyr::select(model, everything()) %>% 
                    knitr::kable()

####Error message

Error: Problem with `mutate()` input `model`.
x Input `model` can't be recycled to size 4.
ℹ Input `model` is `c("bag", "rf", "boost")`.
ℹ Input `model` must be size 4 or 1, not 3.
Run `rlang::last_error()` to see where the error occurred.

#####################################################
##Out-of-sample performance
#####################################################

# bagged trees
final_fit_bag <- last_fit(
                     wflow_bag,
                       split = split)
# random forest
final_fit_rf <- last_fit(
                  wflow_rf,
                    split = split)
# boosted trees
final_fit_boost <- last_fit(
                      wflow_boost,
                          split = split)

數據幀 - FID

structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"), Monsoon = structure(c(2L,
2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L,
4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L,
3L, 3L, 2L), .Label = c("First_Inter_Monssoon", "North_Monsoon",
"Second_Inter_Monsoon", "South_Monsson"), class = "factor"),
    Frequency = c(36, 28, 39, 46, 5, 0, 0, 22, 10, 15, 8,
    33, 33, 29, 31, 23, 8, 9, 7, 40, 41, 41, 30, 30, 44, 37,
    41, 42, 20, 0, 7, 27, 35, 27, 43, 38), Days = c(31,
    28, 31, 30, 6, 0, 0, 29, 15, 29, 29, 31, 31, 29, 30, 30,
    7, 0, 7, 30, 30, 31, 30, 27, 31, 28, 30, 30, 21, 0, 7, 26,
    29, 27, 29, 29)), row.names = c(NA, -36L), class = "data.frame")

Answer 1

我相信擬合線性模型的錯誤來自Month和Monsoon彼此之間的關系。 它們完全相關：

library(tidyverse) 

fid_df <- structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015, 
                                  2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 
                                  2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 
                                  2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L, 
                                                                                                 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 
                                                                                                 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
                                                                                                 8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March", 
                                                                                                                                    "April", "May", "June", "July", "August", "September", "October", 
                                                                                                                                    "November", "December"), class = "factor"), Monsoon = structure(c(2L, 
                                                                                                                                                                                                      2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 
                                                                                                                                                                                                      4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 
                                                                                                                                                                                                      3L, 3L, 2L), .Label = c("First_Inter_Monssoon", "North_Monsoon", 
                                                                                                                                                                                                                              "Second_Inter_Monsoon", "South_Monsson"), class = "factor"), 
                         Frequency = c(36, 28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 
                                       33, 33, 29, 31, 23, 8, 9, 7, 40, 41, 41, 30, 30, 44, 37, 
                                       41, 42, 20, 0, 7, 27, 35, 27, 43, 38), Days = c(31, 
                                                                                       28, 31, 30, 6, 0, 0, 29, 15, 29, 29, 31, 31, 29, 30, 30, 
                                                                                       7, 0, 7, 30, 30, 31, 30, 27, 31, 28, 30, 30, 21, 0, 7, 26, 
                                                                                       29, 27, 29, 29)), row.names = c(NA, -36L), class = "data.frame")


fid_df %>%
  count(Month, Monsoon)
#>        Month              Monsoon n
#> 1    January        North_Monsoon 3
#> 2   February        North_Monsoon 3
#> 3      March First_Inter_Monssoon 3
#> 4      April First_Inter_Monssoon 3
#> 5        May        South_Monsson 3
#> 6       June        South_Monsson 3
#> 7       July        South_Monsson 3
#> 8     August        South_Monsson 3
#> 9  September        South_Monsson 3
#> 10   October Second_Inter_Monsoon 3
#> 11  November Second_Inter_Monsoon 3
#> 12  December        North_Monsoon 3

如果您在線性模型中使用這樣的變量，則模型無法找到兩組系數的估計值：

lm(Frequency ~ ., data = fid_df) %>% summary()
#> 
#> Call:
#> lm(formula = Frequency ~ ., data = fid_df)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -15.0008  -3.9357   0.6564   2.9769  12.7681 
#> 
#> Coefficients: (3 not defined because of singularities)
#>                               Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)                 -7286.9226  3443.9292  -2.116   0.0459 *
#> Year                            3.6155     1.7104   2.114   0.0461 *
#> MonthFebruary                  -3.2641     6.6172  -0.493   0.6267  
#> MonthMarch                      0.1006     6.5125   0.015   0.9878  
#> MonthApril                      0.4843     6.5213   0.074   0.9415  
#> MonthMay                       -4.0308    11.0472  -0.365   0.7187  
#> MonthJune                       1.0135    15.5046   0.065   0.9485  
#> MonthJuly                      -2.6910    13.6106  -0.198   0.8451  
#> MonthAugust                    -4.9307     6.6172  -0.745   0.4641  
#> MonthSeptember                 -1.7105     7.1126  -0.240   0.8122  
#> MonthOctober                   -7.6981     6.5685  -1.172   0.2538  
#> MonthNovember                  -8.7484     6.5493  -1.336   0.1953  
#> MonthDecember                  -1.6981     6.5685  -0.259   0.7984  
#> MonsoonNorth_Monsoon                NA         NA      NA       NA  
#> MonsoonSecond_Inter_Monsoon         NA         NA      NA       NA  
#> MonsoonSouth_Monsson                NA         NA      NA       NA  
#> Days                            1.1510     0.4540   2.535   0.0189 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 7.968 on 22 degrees of freedom
#> Multiple R-squared:  0.8135, Adjusted R-squared:  0.7033 
#> F-statistic: 7.381 on 13 and 22 DF,  p-value: 2.535e-05

^{由reprex 包(v0.3.0.9001) 於 2020 年 11 月 18 日創建}

由於您擁有此信息，我建議您使用一些領域知識來決定是在模型中使用Month還是Monsoon ，但不要同時使用兩者。

Answer 2

回答 Julia Silge 的建議

#split this single dataset into two: a training set and a testing set
data_split <- initial_split(Tidy_df)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data)

###########################################################
##Produce the recipe

rec <- recipe(Frequency_Blue ~ ., data = Tidy_df) %>% 
          step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
          step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
          step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
          step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

###########################################################
##Create Models
###########################################################

##########################################################
##General Linear Models
#########################################################

##glm
mod_glm<-linear_reg(mode="regression",
                       penalty = 0.1, 
                       mixture = 1) %>% 
                            set_engine("glmnet")

##Create workflow
wflow_glm <- workflow() %>% 
                add_recipe(rec) %>%
                      add_model(mod_glm)

##Fit the model
plan(multisession)

fit_glm <- fit_resamples(
                        wflow_glm,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE)
                        )

##########################################################
##Bagged Trees
##########################################################

#####Bagged Trees
mod_bag <- bag_tree() %>%
            set_mode("regression") %>%
             set_engine("rpart", times = 10) #10 bootstrap resamples
                

##Create workflow
wflow_bag <- workflow() %>% 
                   add_recipe(rec) %>%
                       add_model(mod_bag)

##Fit the model
plan(multisession)

fit_bag <- fit_resamples(
                      wflow_bag,
                      cv,
                      metrics = metric_set(rmse, rsq),
                      control = control_resamples(save_pred = TRUE)
                      )

###################################################
##Random forests
###################################################

mod_rf <-rand_forest(trees = 1e3) %>%
                              set_engine("ranger",
                              num.threads = parallel::detectCores(), 
                              importance = "permutation", 
                              verbose = TRUE) %>% 
                              set_mode("regression") 
                              
##Create Workflow

wflow_rf <- workflow() %>% 
               add_model(mod_rf) %>% 
                     add_recipe(rec)

##Fit the model

plan(multisession)

fit_rf<-fit_resamples(
             wflow_rf,
             cv,
             metrics = metric_set(rmse, rsq),
             control = control_resamples(save_pred = TRUE)
             )

############################################################
##Boosted Trees
############################################################

mod_boost <- boost_tree() %>% 
                 set_engine("xgboost", nthreads = parallel::detectCores()) %>% 
                      set_mode("regression")

##Create workflow

wflow_boost <- workflow() %>% 
                  add_recipe(rec) %>% 
                    add_model(mod_boost)

##Fit model

plan(multisession)

fit_boost <-fit_resamples(
                       wflow_boost,
                       cv,
                       metrics = metric_set(rmse, rsq),
                       control = control_resamples(save_pred = TRUE)
                       )

##############################################
##Evaluate the models
##############################################

collect_metrics(fit_bag) %>% 
        bind_rows(collect_metrics(fit_rf)) %>%
          bind_rows(collect_metrics(fit_boost)) %>% 
            bind_rows(collect_metrics(fit_glm)) %>% 
              dplyr::filter(.metric == "rmse") %>% 
                dplyr::mutate(model = c("bag", "rf", "boost", "glm")) %>% 
                 dplyr::select(model, everything()) %>% 
                    knitr::kable()

##rmse values for all models

|model |.metric |.estimator |      mean|  n|  std_err|
|:-----|:-------|:----------|---------:|--:|--------:|
|bag   |rmse    |standard   |  8.929936| 10| 1.544587|
|rf    |rmse    |standard   | 10.188710| 10| 1.144354|
|boost |rmse    |standard   |  9.249904| 10| 1.489482|
|glm   |rmse    |standard   | 11.348420| 10| 2.217807|

#####################################################
##Out-of-sample performance
#####################################################
#glm

# bagged trees
final_fit_glm <- last_fit(
                     wflow_glm,
                        split = split)


# bagged trees
final_fit_bag <- last_fit(
                     wflow_bag,
                       split = split)
# random forest
final_fit_rf <- last_fit(
                  wflow_rf,
                    split = split)
# boosted trees
final_fit_boost <- last_fit(
                      wflow_boost,
                          split = split)

Tidymodel 包：R 中的通用線性模型 (glm) 和決策樹（袋裝樹、提升樹和隨機森林）模型

問題描述

2 個解決方案

解決方案1
1 已采納 2020-11-18 20:58:08

解決方案2
1 2020-11-19 02:31:43

Tidymodel 包：R 中的通用線性模型 (glm) 和決策樹（袋裝樹、提升樹和隨機森林）模型

問題描述

2 個解決方案

解決方案1 1 已采納 2020-11-18 20:58:08

解決方案2 1 2020-11-19 02:31:43

解決方案1
1 已采納 2020-11-18 20:58:08

解決方案2
1 2020-11-19 02:31:43