简体   繁体   English

survfit() 为区间删失数据生成的难以置信的宽置信区间

[英]Implausibly wide confidence intervals produced by survfit() for interval censored data

I have data which is generated by intermittent interviews in which an individual is asked whether they are experiencing a certain symptom.我有通过间歇性访谈生成的数据,在访谈中询问个人是否正在经历某种症状。 The last time each individual was known to not have this particular symptom, is denoted as tstart .最后一次已知每个人没有这种特定症状的时间表示为tstart If applicable, the time at which the individual is observed to have developed the symptom is tstop .如果适用,观察到个体出现症状的时间是tstop Using the survival package in R, a survival object is created with the Surv function, specifying that this is interval censored data.使用R中的survival package,用Surv function创建一个survival object,指定这是区间截尾数据。 I would like a non-parametric maximum likelihood estimate of the survival function. This can be accomplished using the survfit function, which seems to pass the call to an internal function survfitTurnbull .我想要生存 function 的非参数最大似然估计。这可以使用survfit function 来完成,它似乎将调用传递给内部 function survfitTurnbull The resulting confidence intervals are implausibly wide.由此产生的置信区间宽得令人难以置信。 I am unable to figure out why this is the case.我无法弄清楚为什么会这样。

# A random sample of the data using dput()
structure(list(tstart = c(0.01, 38, 0.01, 0.01, 23, 26, 0.01, 
19, 0.01, 0.01, 22, 6, 0.01, 14, 16, 0.01, 0.01, 0.01, 0.01, 
21, 15, 0.01, 0.01, 13, 10, 0.01, 0.01, 19, 0.01, 0.01, 0.01, 
0.01, 22, 17, 27, 14, 16, 0.01, 20, 27, 10, 0.01, 0.01, 16, 20, 
7, 6, 15, 0.01, 0.01), tstop = c(4.01, NA, 5.01, 8.01, NA, NA, 
5.01, NA, 3.01, 16.01, NA, 6.01, 8.01, NA, NA, 7.01, 16.01, 1.01, 
10.01, NA, NA, 5.01, 8.01, NA, NA, 2.01, 3.01, NA, 7.01, 5.01, 
2.01, 9.01, NA, NA, NA, NA, NA, 10.01, NA, NA, NA, 5.01, 10.01, 
NA, NA, NA, 7.01, NA, 14.01, 4.01)), row.names = c(NA, -50L), class = "data.frame")

survObj <- with(temp_df, Surv(time = tstart, time2 = tstop, type = "interval2"))
survFit <- survfit(SurvObj ~ 1))
summary(survFit)

The confidence interval does not narrow over time.置信区间不会随时间缩小。 It is no narrower using the whole dataset (which is contains approximately 10 times the number of events).使用整个数据集(包含大约 10 倍的事件数)并没有缩小范围。 I am unable to figure out what is going wrong.我无法弄清楚出了什么问题。

For what it's worth, this does not look like a bug in the software , but rather a potential limitation of using something as flexible as the non-parameteric maximum likelihood estimator (NPMLE, also known as the Turnbull estimator, which survfit is fitting if you give it interval censored data) for estimating a survival curve.对于它的价值,这看起来不像是软件中的错误,而是使用像非参数最大似然估计器(NPMLE,也称为 Turnbull 估计器, survfit适合如果你给它区间截尾数据)来估计生存曲线。 The TLDR version of this answer is that I suggest you use a parametric model such as Weibull, either using survival::survreg , icenReg::ic_par or icenReg::ic_bayes .这个答案的 TLDR 版本是我建议您使用参数 model,例如 Weibull,使用survival::survregicenReg::ic_paricenReg::ic_bayes Admission of bias: I'm the author of icenReg.承认偏见:我是 icenReg 的作者。

A somewhat technical but very relevant note about the NPMLE is that it only assigns positive probability mass to Turnbull Intervals, which are intervals defined as having the left side of the interval being the left side of some observation interval and the right side of the Turnbull Interval being the next closest right-side of any of the observation intervals.关于 NPMLE 的一个有点技术性但非常相关的说明是它只将正概率质量分配给 Turnbull 区间,这些区间定义为区间的左侧是某个观察区间的左侧,而 Turnbull 区间的右侧是任何观察区间的下一个最近的右侧。 To illustrate, I've plotted your observation intervals and the corresponding Turnbull intervals.为了说明,我绘制了您的观察间隔和相应的 Turnbull 间隔。

在此处输入图像描述

Note that there is a huge gap between the last two Turnbull intervals, This leads to a very "jumpy" NPMLE.请注意,最后两个 Turnbull 间隔之间存在巨大差距,这会导致非常“跳跃”的 NPMLE。 which also leads to quite a bit of error in-between the jumps.这也会导致跳跃之间出现相当多的错误。

After having spent a long time thinking about this issue, my quick summary is that this is a consequence of having only mildly informative data and too much flexibility.在花了很长时间思考这个问题之后,我的快速总结是,这是只有少量信息数据和太多灵活性的结果。 In most survival analysis cases, it is reasonable to assume a smooth survival curve, such as a parametric distribution.在大多数生存分析案例中,假设一条平滑的生存曲线(例如参数分布)是合理的。 As long as the distribution is not too overly restrictive (read: the one parameter exponential distribution), this mild assumption of smoothness allows you to gain much more information out of your data without introducing too much bias.只要分布不是过于严格(阅读:单参数指数分布),这种温和的平滑假设允许您从数据中获得更多信息,而不会引入太多偏差。

To illustrate, I've attached a plot of a Weibull fit + confidence intervals and the fitted NPMLE next to it.为了说明,我附上了威布尔拟合 + 置信区间的 plot 和它旁边的拟合 NPMLE。

在此处输入图像描述

FYI, the box that you see with the NPMLE is not a confidence interval, but rather that the NPMLE is only unique up to the probability assigned to each Turnbull interval, but how that probability is distributed within the interval does not affect the log-likelihood.仅供参考,您在 NPMLE 中看到的框不是置信区间,而是 NPMLE 仅在分配给每个 Turnbull 区间的概率内是唯一的,但概率在区间内的分布方式不会影响对数似然. So any survival curve that passes through that box maximizes the log-likelihood.因此,任何通过该框的生存曲线都会使对数似然最大化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM