简体   繁体   中英

Producing a more efficient for loop

I have created a function which applies a Cox regression model to test data, creates survival functions based on covariates, and then predicts the survival probability 30 days from current time for each test observation.

The example below uses the lung dataset and works quite well. However, applied to my own data the processing time is tedious. For n = 60000, I just stopped it after an hour as it is not practical for what I intend to use the program for.

Looking at the code structure, is there an obvious way I can speed this up?

require(dplyr, survival, pec)

cox_model <- coxph(Surv(time, status) ~ sex, data = lung)

surv_preds <- function(model, query) {

  prediction <- vector(mode = "numeric", length = nrow(query))
  time <- 30

  for(i in 1:nrow(query)) {
    prediction[i] <- predictSurvProb(model, newdata = query[i, ], times = query[i, "time"] + time)
  }
  prediction
}

surv_preds(cox_model, lung)

An alternate route would be the by_row function from purrrlyr in addition to the packages in the question.

library(purrrlyr)

prediction <- lung %>%
  mutate(time = time + 30) %>%
  by_row(~predictSurvProb(cox_model, newdata = ., times = .$time)) %>%
  .$.out %>%
  unlist

It is more tidy while producing the same result, however, early runs of microbenchmark don't show an improvement in processing time.

microbenchmark

# Unit: seconds
#                        expr      min       lq    mean   median       uq      max neval
# surv_preds(cox_model, lung) 1.531631 1.561518 1.59431 1.574664 1.591117 2.157002   100
# (purrrlyr)                  1.841713 1.887438 1.921371 1.90474 1.92649  2.170205   100

This would have both solutions running over 24 hours on the hardware I'm using. Given this answer no longer seems to solve your problem and I am unfamiliar with parallel processing options I'm happy to delete this unless anyone finds value in keeping this here.

SOLVED!! In case it is of interest, I want to post the solution that I used. I managed to remove the need for the for loop entirely.

predictSurvProb(cox_model, 
                newdata = lung, 
                times = lung[ , "time"] + 30)[1, ]

This gives me the output I require. The key is that I index the first row and all its columns from the resulting matrix. This code uses the unique survival function estimate for each observation to predict the survival probability 30 days from the observation's current position on the curve.

The answer from @thc actually pointed me in the right direction in the end.

You don't need to predict one row at a time. You can do it all at once. Eg:

cox_model <- coxph(Surv(time, status) ~ sex, data = lung)

surv_preds <- function(model, query) {

  prediction <- vector(mode = "numeric", length = nrow(query))
  time <- 30

  for(i in 1:nrow(query)) {
    prediction[i] <- predictSurvProb(model, newdata = query[i, ], times = query[i, "time"] + time)
  }
  prediction
}

surv_preds2 <- function(model, query) {

time <- 30

prediction <- predictSurvProb(model, newdata = query, times = query[, "time"] + time)
  prediction
}


microbenchmark(surv_preds(cox_model, lung), surv_preds2(cox_model, lung), times=5)

Results:

Unit: milliseconds
                         expr       min         lq      mean     median         uq        max neval cld
  surv_preds(cox_model, lung) 1017.5587 1031.58422 1056.7026 1062.30476 1072.33865 1099.72672     5   b
 surv_preds2(cox_model, lung)   30.3567   30.78582   35.7851   31.81206   33.00534   52.96559     5  a 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM