繁体   English   中英

在 R 线性回归中处理嵌套变量

[英]Dealing with nested variables in R linear regression

我有一个包含一些嵌套变量的数据集。 例如,我有以下变量:一辆车的speed ,是否存在跟随它的另一辆车other_car以及如果有另一辆车,两辆车之间的distance 虚拟数据集:

speed <- c(30,50,60,30,33,54,65,33,33,54,65,34,45,32)
other_car <- c(0,1,0,0,0,1,1,1,1,0,1,0,1,0)
distance <- c(NA,20,NA,NA,NA,21,5,15,17,NA,34,NA,13,NA)

dft <- data.frame(speed,other_car,distance)

我想以嵌套变量的形式在 model 中包含变量other_cardistance ,即如果汽车存在,还要考虑距离。 按照此处提到的方法: https://stats.stackexchange.com/questions/372257/how-do-you-deal-with-nested-variables-in-a-regression-model ,我尝试了以下方法:

dft <- data.frame(speed,other_car,distance)
dft$other_car<-factor(dft$other_car)

lm_speed <- lm(speed ~ dft$other_car + dft$other_car:dft$distance)
summary(lm_speed)

这给出了以下错误:

contrasts<- ( *tmp* , value = contr.funs[1 + isOF[nn]]):对比只能应用于具有 2 个或更多级别的因子

有任何想法吗?

这是因为当other_car==0时,距离都等于NA请参阅

dft$distance[dft$other_car==0]
[1] NA NA NA NA NA NA NA

您可以为other_car==0分配一个恒定距离来替换NA ,以便 model 使用因子other_car==0并发现距离对此子集没有影响:

dft$distance[dft$other_car==0]<-0

dft$other_car<- factor(dft$other_car)

lm_speed <- lm(speed ~ other_car + other_car:distance, data = dft)
summary(lm_speed)

Call:
lm(formula = speed ~ other_car + other_car:distance, data = dft)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.015  -8.500  -3.876   8.894  21.000 

Coefficients: (1 not defined because of singularities)
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          39.0000     5.0405   7.737 8.96e-06 ***
other_car1            4.6480    13.0670   0.356    0.729    
other_car0:distance       NA         NA      NA       NA    
other_car1:distance   0.3157     0.6133   0.515    0.617    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.34 on 11 degrees of freedom
Multiple R-squared:  0.1758,    Adjusted R-squared:  0.026 
F-statistic: 1.174 on 2 and 11 DF,  p-value: 0.3452

另一种解决方法可能是将factor转换为numeric ,但这与 model 不同

speed <- c(30,50,60,30,33,54,65,33,33,54,65,34,45,32)
other_car <- c(0,1,0,0,0,1,1,1,1,0,1,0,1,0)
distance <- c(NA,20,NA,NA,NA,21,5,15,17,NA,34,NA,13,NA)

dft <- data.frame(speed,other_car,distance)



dft$other_car<- as.numeric(factor(dft$other_car))

lm_speed <- lm(speed ~ other_car + other_car:distance, data = dft)
summary(lm_speed)

Call:
lm(formula = speed ~ other_car + other_car:distance, data = dft)

Residuals:
        2         6         7         8         9        11        13 
  0.03776   3.72205  19.77341 -15.38369 -16.01511  10.61782  -2.75227 

Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)  
(Intercept)         43.6480    12.9010   3.383   0.0196 *
other_car                NA         NA      NA       NA  
other_car:distance   0.1579     0.3281   0.481   0.6508  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.27 on 5 degrees of freedom
  (7 observations deleted due to missingness)
Multiple R-squared:  0.04424,   Adjusted R-squared:  -0.1469 
F-statistic: 0.2314 on 1 and 5 DF,  p-value: 0.6508

这表明速度随着与其他汽车的距离而增加(或者反过来,当其他汽车太近时,司机往往会放慢速度)。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM