如何在 r 中安装 Cox PH model（不止一个因素）？

Question

I have a dataset for studying breast cancer patients.我有一个用于研究乳腺癌患者的数据集。 I wanted to fit a Cox proportional model (without considering the interaction term).我想拟合 Cox 比例 model（不考虑交互项）。

The variables contain age (<40 =1, 40~60=2,>60=3), predominant site (not middle=1, middle=2,unknown=9), maximum diameter (<2.5=1, 2.5~5.5=2), menopausal status (<2 year=1, >2 years=2,unknown=9), estrogen level ( neg=0, pos=1,unknown=9), progesterone levels (neg=0, pos=1,unknown=9) and w.censored (0=censored,1=not censored).变量包含年龄(<40 =1, 40~60=2,>60=3),主要部位(not middle=1, middle=2,unknown=9),最大直径(<2.5=1, 2.5~5.5 =2),绝经状态(<2 年=1, >2 年=2,unknown=9),雌激素水平(neg=0, pos=1,unknown=9),孕酮水平(neg=0, pos=1 ,unknown=9) 和w.censored (0=censored,1=not censored)。

    'data.frame':   572 obs. of  6 variables:
 $ age       : Factor w/ 3 levels "1","2","3": 1 1 2 1 2 1 1 2 2 1 ...
 $ mepl.sts  : Factor w/ 3 levels "1","2","9": 1 3 2 1 2 1 1 2 1 1 ...
 $ pre.site  : Factor w/ 3 levels "1","2","9": 1 3 1 3 3 1 3 3 1 3 ...
 $ max.dia   : Factor w/ 4 levels "1","2","3","9": 1 2 4 3 3 3 3 3 2 2 ...
 $ es.level  : Factor w/ 3 levels "0","1","9": 3 3 3 2 2 1 1 1 1 1 ...
 $ prog.level: Factor w/ 3 levels "0","1","9": 3 3 3 1 1 1 1 1 1 1 ...

First I turned all these categorical variables into factors using as. factor首先，我使用as. factor as. factor . as. factor 。 Then I did a Cox fit of all the variables in R and got the following results.然后我对 R 中的所有变量进行 Cox 拟合，得到以下结果。

fit <- coxph(Surv(surv.day,w.cens) ~age + mepl.sts + pre.site+max.dia
               + es.level+prog.level ,data=bcnew)
summary(fit)
> summary(fit)
Call:
coxph(formula = Surv(surv.day, w.cens) ~ age + mepl.sts + pre.site + 
    max.dia + es.level + prog.level, data = bcnew)

  n= 572, number of events= 74 

               coef exp(coef) se(coef)      z Pr(>|z|)    
age2        -0.6059    0.5456   0.4408 -1.374 0.169287    
age3        -0.1771    0.8377   0.5457 -0.325 0.745463    
mepl.sts2    0.1884    1.2073   0.4145  0.455 0.649412    
mepl.sts9   -0.1904    0.8266   0.6041 -0.315 0.752580    
pre.site2    0.9555    2.5999   0.3594  2.659 0.007846 ** 
pre.site9    0.6220    1.8627   0.3260  1.908 0.056347 .  
max.dia2     1.0824    2.9518   0.3722  2.908 0.003632 ** 
max.dia3     1.9059    6.7256   0.4570  4.170 3.04e-05 ***
max.dia9    -0.8610    0.4227   0.6148 -1.400 0.161380    
es.level1   -0.6653    0.5141   0.3525 -1.887 0.059152 .  
es.level9   -1.1442    0.3185   0.3101 -3.690 0.000225 ***
prog.level1 -1.1256    0.3245   0.3921 -2.871 0.004095 ** 
prog.level9      NA        NA   0.0000     NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 

Concordance= 0.826  (se = 0.021 )
Likelihood ratio test= 103.8  on 12 df,   p=<2e-16
Wald test            = 94.27  on 12 df,   p=7e-15
Score (logrank) test = 140.6  on 12 df,   p=<2e-16

Since the p-value of age and menopausal status was > 0.1, the model was由于年龄和绝经状态的 p 值 > 0.1，因此 model 为

h(t|x)=h0(t)exp(0.9555 pre.site(2)+0.622 pre.site(9)+1.0824 max.dia(2)+1.9059 max.dia(3) -0.6653 es.level(1)-1.1442 es.level(9)-1.1256*prog.level(1)) h(t|x)=h0(t)exp(0.9555 pre.site(2)+0.622 pre.site(9)+1.0824 max.dia(2)+1.9059 max.dia(3) -0.6653 es.level( 1)-1.1442 es.level(9)-1.1256*prog.level(1))

I don't know if this model is correct, but I think there is something strange about the results.我不知道这个 model 是否正确，但我认为结果有些奇怪。 It is common sense that a person's age would have an effect on survival time, but each age grouping gets a p-value much greater than 0.1.一个人的年龄会对生存时间产生影响，这是常识，但每个年龄组的 p 值都远大于 0.1。

By the way, if I have 20 variables grouped in my dataset, can I use the step(fit) procedure to get the final model?顺便说一句，如果我的数据集中有 20 个变量，我可以使用step(fit)过程来获得最终的 model 吗？

Thank you very much!非常感谢！

Answer 1

One source of your problem is trying to fit too many coefficients on too small a data set.您的问题的一个来源是试图在太小的数据集上拟合太多的系数。 Although 572 sounds like a lot of cases, the information in a survival model is essentially determined by the number of events.虽然 572 听起来像是很多案例，但生存 model 中的信息本质上是由事件的数量决定的。 You only have 74 events.您只有 74 个事件。

The usual rule of thumb to avoid overfitting in survival analysis is to have 10-20 events per coefficient that you are estimating, unless you are using some form of penalization.避免在生存分析中过度拟合的通常经验法则是，每个估计的系数有 10-20 个事件，除非您使用某种形式的惩罚。 With 74 events, you should only be trying to fit 4 to 7 coefficients.对于 74 个事件，您应该只尝试拟合 4 到 7 个系数。 You are fitting 12. That runs a couple of types of risk.你很适合 12。这会带来几种风险。

One is missing true associations with outcome, with high standard errors of coefficients due to small event numbers.一是缺少与结果的真正关联，由于事件数量少，系数的标准误差很高。 That might be what's going on with age--adding more predictors to the model can diminish the apparent significance of other predictors that are associated with outcome.这可能是与年龄有关的情况——在 model 中添加更多预测因子可以降低与结果相关的其他预测因子的明显重要性。

The other is finding false associations with outcome that might happen to occur in this data set but wouldn't replicate in another--you might just fit noise in these data.另一个是发现与可能发生在这个数据集中但不会在另一个数据集中复制的结果的错误关联——你可能只是在这些数据中加入了噪音。 In this case with so many predictors, I would guess that your reasonably high concordance of 0.82 with this model wouldn't be found on a new data set.在这种有这么多预测变量的情况下，我猜想在新数据集上找不到您与此 model 相当高的 0.82 一致性。

I'd recommend using Frank Harrell's course notes and book on Regression Modeling Strategies as a guide.我建议使用 Frank Harrell 的课程笔记和回归建模策略书籍作为指南。 You'll find there several other ways to improve your modeling, including:您会发现还有其他几种改进建模的方法，包括：

Don't set up separate categories for "unknown."不要为“未知”设置单独的类别。 Use imputation to estimate values in a way that allows the modeling to take that extra uncertainty into account.使用插补来估计值，使建模能够考虑到额外的不确定性。 That's a well respected procedure that will help prevent bias.这是一个备受推崇的程序，有助于防止偏见。
Don't bin continuous predictors like age into groups.不要将年龄等连续预测变量分组。 Model them as continuous, and as flexibly as possible. Model 它们尽可能连续，并且尽可能灵活。 For example, young people often have a more aggressive form of cancer (eg, due to a genetic problem) than older individuals, so they die sooner after diagnosis.例如，年轻人通常比老年人患有更具侵袭性的癌症（例如，由于遗传问题），因此他们在诊断后会更快死亡。 A U-shaped association of age with survival after diagnosis, as your data indicate, is quite possible.正如您的数据所表明的那样，诊断后的年龄与生存率呈 U 型关联是很有可能的。 Your model should be able to handle that.您的 model 应该能够处理。
Don't fit a model and then arbitrarily throw out predictors whose coefficients have p-values > 0.1.不要拟合 model，然后任意丢弃系数具有 > 0.1 的预测变量。 Don't focus so much on p-values at all.根本不要太关注 p 值。 Avoid automated model selection , as you propose with step(fit) .避免自动 model 选择，因为您建议使用step(fit) 。
Use bootstrap resampling to validate and calibrate your model, checking how much overfitting might be involved.使用引导重采样来验证和校准 model，检查可能涉及多少过拟合。

Survival analysis is tricky.生存分析很棘手。 If you omit any predictor associated with outcome you can bias results for included predictors.如果您省略任何与结果相关的预测变量，您可能会对包含的预测变量的结果产生偏差。 But if you include more predictors than your number of events allow, you risk overfitting.但是，如果您包含的预测变量超出了您所允许的事件数量，则可能会出现过度拟合。 The Harrell references should point you in the right direction. Harrell 的参考资料应该为您指明正确的方向。

Answer 2

If you know with fairly high certainty that a covariate has clinical relevance, you include it in the model even if it happens to fit badly on your training data.如果您非常确定地知道协变量具有临床相关性，那么即使它恰好不适合您的训练数据，您也可以将其包含在 model 中。 Essentially, your prior is strong enough that you would need large amounts of negative evidence to reject it.从本质上讲，你的先验足够强大，以至于你需要大量的负面证据来拒绝它。

Now why doesn't age seem significant in your model?现在为什么您的 model 的年龄似乎不显着？ I'm not familiar with the data so I can't tell.我不熟悉数据，所以我不能说。 What I can tell you is that I would be much more methodical in developing the model.我可以告诉你的是，我在开发 model 时会更有条理。

A good first step is to study the individual covariates and their relationship to the outcome.一个好的第一步是研究各个协变量及其与结果的关系。 Do that?去做？ Does age still look insignificant?年龄看起来仍然微不足道吗？

如何在 r 中安装 Cox PH model（不止一个因素）？

问题描述

2 个解决方案

解决方案1
4 2022-07-12 12:49:44

解决方案2
1 2022-07-12 08:14:06

如何在 r 中安装 Cox PH model（不止一个因素）？

问题描述

2 个解决方案

解决方案1 4 2022-07-12 12:49:44

解决方案2 1 2022-07-12 08:14:06

解决方案1
4 2022-07-12 12:49:44

解决方案2
1 2022-07-12 08:14:06