简体   繁体   中英

R Multilevel Prediction in Tidymodels with Imbalanced Nested Data

Dear All,

I hope that I can inquire your expertise regarding a prediction task in R/Tidymodels. I intend to predict injuries in runners. The daily/weekly training data, on which the predictions are based on, is thereby nested in the individual runners over a timeframe of a few months. This made me consider multilevel models - multilevel binary logistic regression (MLBLR) specifically.

As the data is also very imbalanced I further tried to engaged in resampling via SMOTE. Because half the runners did not incur injuries and the other half mostly only one I am additionally uncertain of the success of this undertaking, as there will be none or only one injury instance per runner within the training set to base the resampling on, and consequently no injury instance in the test set for runners with observed injuries within the testing set. This makes the SMOTE resampling most likely not possible.

So far I tried to manually predict injuries via a MLBLR without resampling and with only adapting the prediction probability threshold, with the outcome of only negative predictions because of the unbalanced nature. Understandably, I did not manage to resample via SMOTE in this scenario, should I rather look at other methods like eg, undersampling non-injury instances or are there any specific resampling procedures (preferably synthetic data creation) for multilevel data, taking the nested structure into account ?

I further tried to implement multilevel modelling in the preferred Tidymodel workflow, as resampling is also made easy there. Thereby, I looked firstly at the "multilevelmod" package which induces multilevel engines (lme4) to the workflow. Secondly, I tried to make use of the many models structure by nesting by each runner and then applying models to it. Unfortunately, I only did get the latter method working. Former, I used most likely incorrectly "stan-glmer" as an engine (Code 1), latter I made working with mixed results via simple oversampling ( Screenshot 2 - ). Thirdly, I am not sure whether to additionally look at fitting generalized linear models using mixed models via the embed package in Tidymodels .

I would be very grateful to hear your take on this, specifically how to approach this issue of implementing a multilevel model + resampling in the Tidymodels workflow. Thank you very much in advance.

Kind regards!

Multillevelmod: https://github.com/tidymodels/multilevelmod Many Models: https://r4ds.had.co.nz/many-models.html Embed: https://embed.tidymodels.org/articles/Applications/GLM.html

Code 1:

 mlbr_mod % set_engine("stan-glmer") # Recipe: mlbr_mod_recipe % step_dummy(all_nominal_predictors()) %>% step_string2factor(Runner) %>% step_smote(NewRRI, over_ratio =0.5) mlbr_mod_workflow % add_recipe(mlbr_mod_recipe) %>% add_model(mlbr_mod, formula = NewRRI ~. -Runner + (1|Runner)) # Fit the model: mlbr_mod_workflow %>% fit(data = RunningData_train) # Train on original set and test on test set using last_fit() mlbr_last_fit % last_fit(RunningData_splits, metrics = metric_set(bal_accuracy, accuracy, f_meas, precision, roc_auc,sensitivity, recall, kap)) # Performance on test set: mlbr_metrics % collect_metrics() mlbr_metrics

The code fails at the step where I try to fit the model. There it gives the error message that it can't subset columns that don't exist

  • X Column 'Patient' doesn't exist. The input data is structured like this:
 Runner - factor: A1, A1, A1, B1, B1, B1, C1, C1, C1... (=IDs) NewRRI - factor: 0, 0, 1, 0, 0, 0, 0, 0, 0... Distance - numeric:340,500,734,110,389,766,833,420,1100... HR - numeric: 120,110,130,142,98, 112,104,117,130... Gender - factor: Male,Female,Male,Male,Male,Female,Male,Female,Female, ... Age - numeric: 23, 36, 56, 35, 67, 24, 52, 39, 29, ... BMI - numeric: 18, 20, 21, 25, 23, 24, 21, 22, 20, ... PreviousRRI -factor:0, 0, 1, 0, 0, 1, 1, 0, 0, ...

Edit (Reproducable Example):

Df <- tibble::tribble(
   ~year_week,~Runner,~NewRRI,~Distance, ~HR, ~Gender,~Age,~BMI,~PreviousRRI,
   "2019-41", "M01"  ,      0,     5000, 120,  "Male",  23,  18,        1,   
   "2019-41", "M02"  ,      0,     6000, 125,"Female",  36,  20,        0,
   "2019-41", "M03"  ,      0,     8000, 130,  "Male",  56,  21,        0,
   "2019-42", "M01"  ,      0,     5500, 122,  "Male",  23,  18,        1,
   "2019-42", "M02"  ,      0,     7000, 128,"Female",  36,  20,        0,
   "2019-42", "M03"  ,      0,    15000, 132,  "Male",  56,  21,        0,
   "2019-43", "M01"  ,      1,     3000, 120,  "Male",  23,  18,        1,
   "2019-43", "M02"  ,      0,     9000, 127,"Female",  36,  20,        0,
   "2019-43", "M03"  ,      0,     9500, 131,  "Male",  56,  21,        0,
   "2019-44", "M01"  ,      0,    15000, 125,  "Male",  23,  18,        1,
   "2019-44", "M02"  ,      0,     9000, 127,"Female",  36,  20,        0,
   "2019-44", "M03"  ,      0,     9500, 131,  "Male",  56,  21,        0,
  ) %>%
  mutate(Gender = as.factor(Gender),
         PreviousRRI = as.factor(PreviousRRI),
         NewRRI = as.factor(NewRRI),
         Runner = as.factor(Runner))

library(tidyverse)
library(tidymodels)
library(multilevelmod)
library(themis)


Df <- Df %>% arrange(year_week)
Df_splits <- initial_time_split(Df, prop = 0.8)
RunningData_train <- training(Df_splits)
RunningData_test <- testing(Df_splits)

# Now apply original code 

I get the following error message:

Error: All columns selected for the step should be numeric
I am not sure what to change within the code to avoid this error message?

In case this does not work, the alternative approach with the many models structure most likely also has its limits regarding the reduced availability of observations as the runners are nested with "nest()" and the respective models are mapped to each nested individual runner, so I guess this also is not a viable strategy to imitate the multilevel structure?

Lastly, I found this article on the "SMOTE-NC / ENC" making the application of the SMOTE algorithm possible, yet most likely with some drawbacks, as new data is added on existing Runner/Patient IDs:

https://arxiv.org/abs/2103.07612

Thank you again for your help and consideration, it is highly appreciated. Kind regards!

It looks like the order of your recipe is causing the problem. First is step_dummy, so it changes 'Runner' into Runner_M02 and Runner_M03

Flipping the order will allow you use that. Enough samples for SMOTE is a different issue and maybe you can use step_upsample() or step_downsample()

Also, it looks like you already have factors so you don't actually need that step.

I've never used glm-stan, so I made this example showing bootstraps using lmer (and changed the outcome to a numeric variable).

model_spec <- linear_reg() %>% set_engine("lmer")

model_rec  <-  recipe(Distance ~ ., data = RunningData_train) %>% 
  step_dummy(all_nominal_predictors()) #%>%  #smote would work with a larger dataset
  step_smote(NewRRI, over_ratio =0.5)

model_rec %>% prep() %>% juice() %>% glimpse()


mixed_model_wf <- workflow() %>%
  add_model(model_spec, formula =  Distance ~ . -Runner + (1|Runner)) %>%
  add_variables(outcomes = Distance, predictors = colnames(RunningData_train %>% dplyr::select(-Distance))) 


fit1 <- fit(mixed_model_wf, RunningData_train)

boots <- bootstraps(Df)

fit2_boots <- fit_resamples(mixed_model_wf, boots)

fit2_boots %>% collect_metrics()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM