简体   繁体   中英

Calibrating a Cox PH model with R packages 'survival' and 'rms': time unit confusion

I built a Cox Proportional Hazards model with the R package "rms" and am trying to cross-validate it. Splitting the data into training and test sets is what I'd like to do, but I'm new to survival analysis and can't find anything in the literature except rms::calibrate. I can't get it to work.

Here is the code:

# using the package 'survival', I make a survival object with
# follow-up time (2000 to 2020) and status (event=1, survival/censoring 0)
surv2 <- Surv(grid3@data$def_mean, grid3@data$status)
d <- datadist(grid3@data) # stores distribution summaries for potential variables??
options(datadist = "d") # seems to help cph() refer to variables
model <- cph(surv2 ~ cost_mean + elev_mean + popn_mean + cop99 + PAs_mean,
                 data = grid3@data, x = TRUE, y = TRUE, surv = TRUE, time.inc=1)
modrms <- rms::calibrate(model, B = 40, u = 1)

'time.inc' is time increments (1yr) - looking at model$surv.summary, I can see survival and 'no. at risk' figures for each of the 20 years. So that makes sense. But calling rms::calibrate the first message I get is Using Cox survival estimates at 1 Days ...and looking at the calibration I get:

> summary(attr(modrms,"predicted"))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       1       1       1       1 

...so it looks like the model has been calibrated over a period of 1 day? And of course everyone survived (1=100%) ...the same thing happens using rms::calibrate(model, B = 40, u = 20) .

I tried again starting with:

units(grid3@data$def_mean) <- "year"
surv3 <- Surv(grid3@data$def_mean, grid3@data$status)

...but that gives me an error!

Error in Ops.units(time, origin) : 
  both operands of the expression should be "units" objects

I don't know what to try next. Wouldn't it be great if I could just build a model with the data from 2000-10, use that to make predictions for 2010-20, and look at predicted vs actual? But I'm stuck with calibration, and the documentation assumes more statistical expertise than mine (college stats plus efforts to improve my math).

Here is the data structure (not sure how I can make this reproducible):

> str(grid3@data)
'data.frame':   36918 obs. of  7 variables:
 $ def_mean  : num  20 20 20 20 20 20 20 20 20 20 ...
 $ status    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ elev_mean : num  -0.664 -0.664 -0.664 -0.664 -0.664 ...
 $ popn_mean : num  -0.1658 0.0664 -0.1484 0.0601 -0.0381 ...
 $ cost_mean : num  1.53 1.48 1.43 1.66 1.6 ...
 $ PAs_mean  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ cop99     : Factor w/ 12 levels "10","20","30",..: 5 5 5 5 5 5 5 5 5 5 ...

> summary(grid3@data$def_mean)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.20   20.00   20.00   19.59   20.00   20.00

> table(grid3@data$status)

    0     1 
34696  2222

The rms package can have a steep learning curve in some respects. You've already picked up on one thing that is sometimes missed: the importance of using datadist() to summarize the predictors, and then setting the datadist option (with the character name of the datadist object) so that summary functions have reasonable default choices for display.

With respect to the second error, I wonder if you maybe didn't re-run the datadist() command and reset the datadist option after you changed the time unit. The units() and label() functions in the rms -associated Hmisc package can be very useful, but if you don't re-run and reset datadist() after using them I suspect that things get confusing for the software downstream. If you specify a unit in one place, it will probably expect the same unit in another place.

Those commands don't do any transformations, however. The default assumption is that the time unit is "day," so that's what gets printed by default in the outputs. If you change the "units" to "year," printouts will show "year" instead of "day" but the underlying calculations won't change.

So although calibrate() first claimed to be calculating at "1 Days" it wasn't really; that was just its default unit for printing. It still did the calibration at time = 1 . Calibration at such an early time is probably not what you want.

I vaguely remember having some problems if the time.inc setting in the original cph() call didn't match the u setting in the calibrate() call. My usual practice is to know the time point at which I want to calibrate (eg, 3-year survival for some types of cancer data) and use that for both those settings. Play a bit with a toy data set to see how to make that work for you.

Finally, calibrate() is best used with plot() to display the calibration curves (ideal, modeled, optimism-corrected by bootstrap). There might be a glitch if you try to print() the calibrate object. The values displayed on the standard plot are correct.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM