[英]how to fit Cox PH model in r (more than one factors)?
I have a dataset for studying breast cancer patients.我有一个用于研究乳腺癌患者的数据集。 I wanted to fit a Cox proportional model (without considering the interaction term).我想拟合 Cox 比例 model(不考虑交互项)。
The variables contain age (<40 =1, 40~60=2,>60=3), predominant site (not middle=1, middle=2,unknown=9), maximum diameter (<2.5=1, 2.5~5.5=2), menopausal status (<2 year=1, >2 years=2,unknown=9), estrogen level ( neg=0, pos=1,unknown=9), progesterone levels (neg=0, pos=1,unknown=9) and w.censored (0=censored,1=not censored).变量包含年龄(<40 =1, 40~60=2,>60=3),主要部位(not middle=1, middle=2,unknown=9),最大直径(<2.5=1, 2.5~5.5 =2),绝经状态(<2 年=1, >2 年=2,unknown=9),雌激素水平(neg=0, pos=1,unknown=9),孕酮水平(neg=0, pos=1 ,unknown=9) 和w.censored (0=censored,1=not censored)。
'data.frame': 572 obs. of 6 variables:
$ age : Factor w/ 3 levels "1","2","3": 1 1 2 1 2 1 1 2 2 1 ...
$ mepl.sts : Factor w/ 3 levels "1","2","9": 1 3 2 1 2 1 1 2 1 1 ...
$ pre.site : Factor w/ 3 levels "1","2","9": 1 3 1 3 3 1 3 3 1 3 ...
$ max.dia : Factor w/ 4 levels "1","2","3","9": 1 2 4 3 3 3 3 3 2 2 ...
$ es.level : Factor w/ 3 levels "0","1","9": 3 3 3 2 2 1 1 1 1 1 ...
$ prog.level: Factor w/ 3 levels "0","1","9": 3 3 3 1 1 1 1 1 1 1 ...
First I turned all these categorical variables into factors using as. factor
首先,我使用as. factor
as. factor
. as. factor
。 Then I did a Cox fit of all the variables in R and got the following results.然后我对 R 中的所有变量进行 Cox 拟合,得到以下结果。
fit <- coxph(Surv(surv.day,w.cens) ~age + mepl.sts + pre.site+max.dia
+ es.level+prog.level ,data=bcnew)
summary(fit)
> summary(fit)
Call:
coxph(formula = Surv(surv.day, w.cens) ~ age + mepl.sts + pre.site +
max.dia + es.level + prog.level, data = bcnew)
n= 572, number of events= 74
coef exp(coef) se(coef) z Pr(>|z|)
age2 -0.6059 0.5456 0.4408 -1.374 0.169287
age3 -0.1771 0.8377 0.5457 -0.325 0.745463
mepl.sts2 0.1884 1.2073 0.4145 0.455 0.649412
mepl.sts9 -0.1904 0.8266 0.6041 -0.315 0.752580
pre.site2 0.9555 2.5999 0.3594 2.659 0.007846 **
pre.site9 0.6220 1.8627 0.3260 1.908 0.056347 .
max.dia2 1.0824 2.9518 0.3722 2.908 0.003632 **
max.dia3 1.9059 6.7256 0.4570 4.170 3.04e-05 ***
max.dia9 -0.8610 0.4227 0.6148 -1.400 0.161380
es.level1 -0.6653 0.5141 0.3525 -1.887 0.059152 .
es.level9 -1.1442 0.3185 0.3101 -3.690 0.000225 ***
prog.level1 -1.1256 0.3245 0.3921 -2.871 0.004095 **
prog.level9 NA NA 0.0000 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
Concordance= 0.826 (se = 0.021 )
Likelihood ratio test= 103.8 on 12 df, p=<2e-16
Wald test = 94.27 on 12 df, p=7e-15
Score (logrank) test = 140.6 on 12 df, p=<2e-16
Since the p-value of age and menopausal status was > 0.1, the model was由于年龄和绝经状态的 p 值 > 0.1,因此 model 为
h(t|x)=h0(t)exp(0.9555 pre.site(2)+0.622 pre.site(9)+1.0824 max.dia(2)+1.9059 max.dia(3) -0.6653 es.level(1)-1.1442 es.level(9)-1.1256*prog.level(1)) h(t|x)=h0(t)exp(0.9555 pre.site(2)+0.622 pre.site(9)+1.0824 max.dia(2)+1.9059 max.dia(3) -0.6653 es.level( 1)-1.1442 es.level(9)-1.1256*prog.level(1))
I don't know if this model is correct, but I think there is something strange about the results.我不知道这个 model 是否正确,但我认为结果有些奇怪。 It is common sense that a person's age would have an effect on survival time, but each age grouping gets a p-value much greater than 0.1.一个人的年龄会对生存时间产生影响,这是常识,但每个年龄组的 p 值都远大于 0.1。
By the way, if I have 20 variables grouped in my dataset, can I use the step(fit)
procedure to get the final model?顺便说一句,如果我的数据集中有 20 个变量,我可以使用step(fit)
过程来获得最终的 model 吗?
Thank you very much!非常感谢!
One source of your problem is trying to fit too many coefficients on too small a data set.您的问题的一个来源是试图在太小的数据集上拟合太多的系数。 Although 572 sounds like a lot of cases, the information in a survival model is essentially determined by the number of events.虽然 572 听起来像是很多案例,但生存 model 中的信息本质上是由事件的数量决定的。 You only have 74 events.您只有 74 个事件。
The usual rule of thumb to avoid overfitting in survival analysis is to have 10-20 events per coefficient that you are estimating, unless you are using some form of penalization.避免在生存分析中过度拟合的通常经验法则是,每个估计的系数有 10-20 个事件,除非您使用某种形式的惩罚。 With 74 events, you should only be trying to fit 4 to 7 coefficients.对于 74 个事件,您应该只尝试拟合 4 到 7 个系数。 You are fitting 12. That runs a couple of types of risk.你很适合 12。这会带来几种风险。
One is missing true associations with outcome, with high standard errors of coefficients due to small event numbers.一是缺少与结果的真正关联,由于事件数量少,系数的标准误差很高。 That might be what's going on with age--adding more predictors to the model can diminish the apparent significance of other predictors that are associated with outcome.这可能是与年龄有关的情况——在 model 中添加更多预测因子可以降低与结果相关的其他预测因子的明显重要性。
The other is finding false associations with outcome that might happen to occur in this data set but wouldn't replicate in another--you might just fit noise in these data.另一个是发现与可能发生在这个数据集中但不会在另一个数据集中复制的结果的错误关联——你可能只是在这些数据中加入了噪音。 In this case with so many predictors, I would guess that your reasonably high concordance of 0.82 with this model wouldn't be found on a new data set.在这种有这么多预测变量的情况下,我猜想在新数据集上找不到您与此 model 相当高的 0.82 一致性。
I'd recommend using Frank Harrell's course notes and book on Regression Modeling Strategies as a guide.我建议使用 Frank Harrell 的课程笔记和回归建模策略书籍作为指南。 You'll find there several other ways to improve your modeling, including:您会发现还有其他几种改进建模的方法,包括:
step(fit)
.避免自动 model 选择,因为您建议使用step(fit)
。Survival analysis is tricky.生存分析很棘手。 If you omit any predictor associated with outcome you can bias results for included predictors.如果您省略任何与结果相关的预测变量,您可能会对包含的预测变量的结果产生偏差。 But if you include more predictors than your number of events allow, you risk overfitting.但是,如果您包含的预测变量超出了您所允许的事件数量,则可能会出现过度拟合。 The Harrell references should point you in the right direction. Harrell 的参考资料应该为您指明正确的方向。
If you know with fairly high certainty that a covariate has clinical relevance, you include it in the model even if it happens to fit badly on your training data.如果您非常确定地知道协变量具有临床相关性,那么即使它恰好不适合您的训练数据,您也可以将其包含在 model 中。 Essentially, your prior is strong enough that you would need large amounts of negative evidence to reject it.从本质上讲,你的先验足够强大,以至于你需要大量的负面证据来拒绝它。
Now why doesn't age seem significant in your model?现在为什么您的 model 的年龄似乎不显着? I'm not familiar with the data so I can't tell.我不熟悉数据,所以我不能说。 What I can tell you is that I would be much more methodical in developing the model.我可以告诉你的是,我在开发 model 时会更有条理。
A good first step is to study the individual covariates and their relationship to the outcome.一个好的第一步是研究各个协变量及其与结果的关系。 Do that?去做? Does age still look insignificant?年龄看起来仍然微不足道吗?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.