I have created a function which applies a Cox regression model to test data, creates survival functions based on covariates, and then predicts the survival probability 30 days from current time for each test observation.
The example below uses the lung dataset and works quite well. However, applied to my own data the processing time is tedious. For n = 60000, I just stopped it after an hour as it is not practical for what I intend to use the program for.
Looking at the code structure, is there an obvious way I can speed this up?
require(dplyr, survival, pec)
cox_model <- coxph(Surv(time, status) ~ sex, data = lung)
surv_preds <- function(model, query) {
prediction <- vector(mode = "numeric", length = nrow(query))
time <- 30
for(i in 1:nrow(query)) {
prediction[i] <- predictSurvProb(model, newdata = query[i, ], times = query[i, "time"] + time)
}
prediction
}
surv_preds(cox_model, lung)
An alternate route would be the by_row
function from purrrlyr
in addition to the packages in the question.
library(purrrlyr)
prediction <- lung %>%
mutate(time = time + 30) %>%
by_row(~predictSurvProb(cox_model, newdata = ., times = .$time)) %>%
.$.out %>%
unlist
It is more tidy while producing the same result, however, early runs of microbenchmark
don't show an improvement in processing time.
# Unit: seconds
# expr min lq mean median uq max neval
# surv_preds(cox_model, lung) 1.531631 1.561518 1.59431 1.574664 1.591117 2.157002 100
# (purrrlyr) 1.841713 1.887438 1.921371 1.90474 1.92649 2.170205 100
This would have both solutions running over 24 hours on the hardware I'm using. Given this answer no longer seems to solve your problem and I am unfamiliar with parallel processing options I'm happy to delete this unless anyone finds value in keeping this here.
SOLVED!! In case it is of interest, I want to post the solution that I used. I managed to remove the need for the for loop entirely.
predictSurvProb(cox_model,
newdata = lung,
times = lung[ , "time"] + 30)[1, ]
This gives me the output I require. The key is that I index the first row and all its columns from the resulting matrix. This code uses the unique survival function estimate for each observation to predict the survival probability 30 days from the observation's current position on the curve.
The answer from @thc actually pointed me in the right direction in the end.
You don't need to predict one row at a time. You can do it all at once. Eg:
cox_model <- coxph(Surv(time, status) ~ sex, data = lung)
surv_preds <- function(model, query) {
prediction <- vector(mode = "numeric", length = nrow(query))
time <- 30
for(i in 1:nrow(query)) {
prediction[i] <- predictSurvProb(model, newdata = query[i, ], times = query[i, "time"] + time)
}
prediction
}
surv_preds2 <- function(model, query) {
time <- 30
prediction <- predictSurvProb(model, newdata = query, times = query[, "time"] + time)
prediction
}
microbenchmark(surv_preds(cox_model, lung), surv_preds2(cox_model, lung), times=5)
Results:
Unit: milliseconds
expr min lq mean median uq max neval cld
surv_preds(cox_model, lung) 1017.5587 1031.58422 1056.7026 1062.30476 1072.33865 1099.72672 5 b
surv_preds2(cox_model, lung) 30.3567 30.78582 35.7851 31.81206 33.00534 52.96559 5 a
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.