简体   繁体   English

在 R 线性回归中处理嵌套变量

[英]Dealing with nested variables in R linear regression

I have a dataset which includes some nested variables.我有一个包含一些嵌套变量的数据集。 For example, I have the following variables: the speed of a car, the existence of another car following it other_car and, if there is another car, the distance between the two cars distance .例如,我有以下变量:一辆车的speed ,是否存在跟随它的另一辆车other_car以及如果有另一辆车,两辆车之间的distance Dummy dataset:虚拟数据集:

speed <- c(30,50,60,30,33,54,65,33,33,54,65,34,45,32)
other_car <- c(0,1,0,0,0,1,1,1,1,0,1,0,1,0)
distance <- c(NA,20,NA,NA,NA,21,5,15,17,NA,34,NA,13,NA)

dft <- data.frame(speed,other_car,distance)

I would like to include the variables other_car and distance in a model with the form of nested variables, ie if the car is present consider also the distance.我想以嵌套变量的形式在 model 中包含变量other_cardistance ,即如果汽车存在,还要考虑距离。 Following an approach mentioned here: https://stats.stackexchange.com/questions/372257/how-do-you-deal-with-nested-variables-in-a-regression-model , I tried the following:按照此处提到的方法: https://stats.stackexchange.com/questions/372257/how-do-you-deal-with-nested-variables-in-a-regression-model ,我尝试了以下方法:

dft <- data.frame(speed,other_car,distance)
dft$other_car<-factor(dft$other_car)

lm_speed <- lm(speed ~ dft$other_car + dft$other_car:dft$distance)
summary(lm_speed)

Which gives the following error:这给出了以下错误:

Error in contrasts<- ( *tmp* , value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels contrasts<- ( *tmp* , value = contr.funs[1 + isOF[nn]]):对比只能应用于具有 2 个或更多级别的因子

Any ideas?有任何想法吗?

This is due to the fact that when other_car==0 , distances are all equal to NA , see :这是因为当other_car==0时,距离都等于NA请参阅

dft$distance[dft$other_car==0]
[1] NA NA NA NA NA NA NA

You could assign a constant distance to replace NA for other_car==0 , so that the model uses the factor other_car==0 and finds out that the distance has no impact for this subset:您可以为other_car==0分配一个恒定距离来替换NA ,以便 model 使用因子other_car==0并发现距离对此子集没有影响:

dft$distance[dft$other_car==0]<-0

dft$other_car<- factor(dft$other_car)

lm_speed <- lm(speed ~ other_car + other_car:distance, data = dft)
summary(lm_speed)

Call:
lm(formula = speed ~ other_car + other_car:distance, data = dft)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.015  -8.500  -3.876   8.894  21.000 

Coefficients: (1 not defined because of singularities)
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          39.0000     5.0405   7.737 8.96e-06 ***
other_car1            4.6480    13.0670   0.356    0.729    
other_car0:distance       NA         NA      NA       NA    
other_car1:distance   0.3157     0.6133   0.515    0.617    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.34 on 11 degrees of freedom
Multiple R-squared:  0.1758,    Adjusted R-squared:  0.026 
F-statistic: 1.174 on 2 and 11 DF,  p-value: 0.3452

Another workaround could be to convert the factor to numeric , but this isn't the same model :另一种解决方法可能是将factor转换为numeric ,但这与 model 不同

speed <- c(30,50,60,30,33,54,65,33,33,54,65,34,45,32)
other_car <- c(0,1,0,0,0,1,1,1,1,0,1,0,1,0)
distance <- c(NA,20,NA,NA,NA,21,5,15,17,NA,34,NA,13,NA)

dft <- data.frame(speed,other_car,distance)



dft$other_car<- as.numeric(factor(dft$other_car))

lm_speed <- lm(speed ~ other_car + other_car:distance, data = dft)
summary(lm_speed)

Call:
lm(formula = speed ~ other_car + other_car:distance, data = dft)

Residuals:
        2         6         7         8         9        11        13 
  0.03776   3.72205  19.77341 -15.38369 -16.01511  10.61782  -2.75227 

Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)  
(Intercept)         43.6480    12.9010   3.383   0.0196 *
other_car                NA         NA      NA       NA  
other_car:distance   0.1579     0.3281   0.481   0.6508  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.27 on 5 degrees of freedom
  (7 observations deleted due to missingness)
Multiple R-squared:  0.04424,   Adjusted R-squared:  -0.1469 
F-statistic: 0.2314 on 1 and 5 DF,  p-value: 0.6508

Which tells that speeds increases with distance to other car (or the other way round, when the other car is too near, drivers tend to slow down).这表明速度随着与其他汽车的距离而增加(或者反过来,当其他汽车太近时,司机往往会放慢速度)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM