简体   繁体   English

使用 R 包“survival”和“rms”校准 Cox PH 模型:时间单位混淆

[英]Calibrating a Cox PH model with R packages 'survival' and 'rms': time unit confusion

I built a Cox Proportional Hazards model with the R package "rms" and am trying to cross-validate it.我使用 R 包“rms”构建了 Cox Proportional Hazards 模型,并尝试对其进行交叉验证。 Splitting the data into training and test sets is what I'd like to do, but I'm new to survival analysis and can't find anything in the literature except rms::calibrate.将数据分成训练集和测试集是我想做的,但我是生存分析的新手,除了 rms::calibrate 之外,在文献中找不到任何东西。 I can't get it to work.我无法让它工作。

Here is the code:这是代码:

# using the package 'survival', I make a survival object with
# follow-up time (2000 to 2020) and status (event=1, survival/censoring 0)
surv2 <- Surv(grid3@data$def_mean, grid3@data$status)
d <- datadist(grid3@data) # stores distribution summaries for potential variables??
options(datadist = "d") # seems to help cph() refer to variables
model <- cph(surv2 ~ cost_mean + elev_mean + popn_mean + cop99 + PAs_mean,
                 data = grid3@data, x = TRUE, y = TRUE, surv = TRUE, time.inc=1)
modrms <- rms::calibrate(model, B = 40, u = 1)

'time.inc' is time increments (1yr) - looking at model$surv.summary, I can see survival and 'no. 'time.inc' 是时间增量 (1yr) - 查看 model$surv.summary,我可以看到生存和 '不。 at risk' figures for each of the 20 years. 20 年中每一年的风险数据。 So that makes sense.所以这是有道理的。 But calling rms::calibrate the first message I get is Using Cox survival estimates at 1 Days ...and looking at the calibration I get:但是调用 rms::calibrate 我得到的第一条消息是Using Cox survival estimates at 1 Days ......并查看我得到的校准:

> summary(attr(modrms,"predicted"))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       1       1       1       1 

...so it looks like the model has been calibrated over a period of 1 day? ...所以看起来模型已经校准了 1 天? And of course everyone survived (1=100%) ...the same thing happens using rms::calibrate(model, B = 40, u = 20) .当然,每个人都活了下来(1=100%)……使用rms::calibrate(model, B = 40, u = 20)发生同样的事情。

I tried again starting with:我再次尝试开始:

units(grid3@data$def_mean) <- "year"
surv3 <- Surv(grid3@data$def_mean, grid3@data$status)

...but that gives me an error! ...但这给了我一个错误!

Error in Ops.units(time, origin) : 
  both operands of the expression should be "units" objects

I don't know what to try next.我不知道接下来要尝试什么。 Wouldn't it be great if I could just build a model with the data from 2000-10, use that to make predictions for 2010-20, and look at predicted vs actual?如果我能用 2000-10 年的数据建立一个模型,用它来预测 2010-20 年,并查看预测与实际,那不是很好吗? But I'm stuck with calibration, and the documentation assumes more statistical expertise than mine (college stats plus efforts to improve my math).但是我坚持校准,并且文档假设了比我更多的统计专业知识(大学统计数据加上努力提高我的数学)。

Here is the data structure (not sure how I can make this reproducible):这是数据结构(不知道如何使其可重现):

> str(grid3@data)
'data.frame':   36918 obs. of  7 variables:
 $ def_mean  : num  20 20 20 20 20 20 20 20 20 20 ...
 $ status    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ elev_mean : num  -0.664 -0.664 -0.664 -0.664 -0.664 ...
 $ popn_mean : num  -0.1658 0.0664 -0.1484 0.0601 -0.0381 ...
 $ cost_mean : num  1.53 1.48 1.43 1.66 1.6 ...
 $ PAs_mean  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ cop99     : Factor w/ 12 levels "10","20","30",..: 5 5 5 5 5 5 5 5 5 5 ...

> summary(grid3@data$def_mean)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.20   20.00   20.00   19.59   20.00   20.00

> table(grid3@data$status)

    0     1 
34696  2222

The rms package can have a steep learning curve in some respects. rms包在某些方面可能具有陡峭的学习曲线。 You've already picked up on one thing that is sometimes missed: the importance of using datadist() to summarize the predictors, and then setting the datadist option (with the character name of the datadist object) so that summary functions have reasonable default choices for display.您已经了解到有时会遗漏的一件事:使用datadist()汇总预测变量的重要性,然后设置datadist选项(使用datadist对象的字符名称)以便汇总函数具有合理的默认选择用于展示。

With respect to the second error, I wonder if you maybe didn't re-run the datadist() command and reset the datadist option after you changed the time unit.关于第二个错误,我想知道您是否可能没有在更改时间单位后重新运行datadist()命令并重置datadist选项。 The units() and label() functions in the rms -associated Hmisc package can be very useful, but if you don't re-run and reset datadist() after using them I suspect that things get confusing for the software downstream. rms相关的Hmisc包中的units()label()函数可能非常有用,但是如果您在使用它们后不重新运行和重置datadist()我怀疑事情会混淆下游的软件。 If you specify a unit in one place, it will probably expect the same unit in another place.如果您在一个地方指定一个单位,它可能会期望在另一个地方使用相同的单位。

Those commands don't do any transformations, however.但是,这些命令不进行任何转换。 The default assumption is that the time unit is "day," so that's what gets printed by default in the outputs.默认假设是时间单位是“天”,因此这就是默认情况下在输出中打印的内容。 If you change the "units" to "year," printouts will show "year" instead of "day" but the underlying calculations won't change.如果将“单位”更改为“年”,打印输出将显示“年”而不是“日”,但基础计算不会改变。

So although calibrate() first claimed to be calculating at "1 Days" it wasn't really;因此,尽管calibrate()最初声称在“1 天”进行计算,但事实并非如此; that was just its default unit for printing.那只是它的默认打印单位。 It still did the calibration at time = 1 .它仍然在time = 1进行校准。 Calibration at such an early time is probably not what you want.在如此早的时间进行校准可能不是您想要的。

I vaguely remember having some problems if the time.inc setting in the original cph() call didn't match the u setting in the calibrate() call.我依稀记得如果原始cph()调用中的time.inc设置与calibrate()调用中的u设置不匹配, time.inc出现一些问题。 My usual practice is to know the time point at which I want to calibrate (eg, 3-year survival for some types of cancer data) and use that for both those settings.我通常的做法是知道我想要校准的时间点(例如,某些类型癌症数据的 3 年生存率)并将其用于这两种设置。 Play a bit with a toy data set to see how to make that work for you.玩一下玩具数据集,看看如何让它适合你。

Finally, calibrate() is best used with plot() to display the calibration curves (ideal, modeled, optimism-corrected by bootstrap).最后, calibrate()最好与plot()一起使用来显示校准曲线(理想的、建模的、由引导程序校正的乐观)。 There might be a glitch if you try to print() the calibrate object.如果您尝试print() calibrate对象,则可能会出现故障。 The values displayed on the standard plot are correct.标准图上显示的值是正确的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM