繁体   English   中英

如何在数据集上运行线性回归,每次都将一个变量作为因变量?

[英]How to run linear regression on a dataset, taking each time a single variable as the dependent variable?

我有一个数据集,其中包含所有称为“ dt”的数字变量。要把每个单个变量作为因变量,并使用逐步回归法来找到剩余预测变量的最佳组合。如果结果“最佳组合”给出了调整后的结果, R ^ 2> 0.70,将其输出到控制台。这是我天真的尝试。

for(i in ncol(dt)){
    nul<-lm(dt[,i]~1,data=dt)
    ful<-lm(dt[,i]~.,data=dt)
    model<-step(nul,scope = list(lower=nul,upper=ful),direction="forward",trace=FALSE)
    if((summary(lm(as.formula(model$call),data=dt)))$adj.r.squared>0.70){
        print(as.formula(model$call))
        cat(paste("\n"))
    }
}

这是我得到的不想要的输出:

dt[, i] ~ Y

Warning messages:
1: attempting model selection on an essentially perfect fit is nonsense 
2: In summary.lm(lm(as.formula(model$call), data = dt)) :
essentially perfect fit: summary may be unreliable

正如@ 42-正确指出的那样,您将获得统计上的“垃圾”。

但是,如果您仍然坚持要“测试”它,则使用jumps :: regsubsets获得多个线性mod的r ^ 2相当容易。

library(leaps)
a <- regsubsets(as.matrix(x=swiss[,-1]),y=swiss[,1], nvmax=1, nbest=100, intercept=F, method="exhaustive", really.big=T)
summary(a) 

Subset selection object
5 Variables 
                 Forced in Forced out
Examination          FALSE      FALSE
Education            FALSE      FALSE
Catholic             FALSE      FALSE
Infant.Mortality     FALSE      FALSE
100 subsets of each size up to 1
Selection Algorithm: exhaustive
         Agriculture Examination Education Catholic Infant.Mortality
1  ( 1 ) " "         " "         " "       " "      "*"             
1  ( 2 ) "*"         " "         " "       " "      " "             
1  ( 3 ) " "         "*"         " "       " "      " "             
1  ( 4 ) " "         " "         " "       "*"      " "             
1  ( 5 ) " "         " "         "*"       " "      " "     

在上面的示例中,以“生育力”为因变量的5 lm mod,每个剩余变量作为每个模型的单个预测变量,例如,生育力〜婴儿,生育力〜农业等。

summary(a)$rsq # returns R^2 for each of the five models

[1] 0.9703145 0.8558076 0.7054873 0.5660736 0.4474043

通过将以上功能更改为:

nonsense_lm <- function(data, x) regsubsets(as.matrix(x=data[,-x]),y=data[,x], nvmax=1, nbest=100, intercept=F, method="exhaustive", really.big=T)

然后循环每个变量作为预测变量:

nonsense <- lapply(1:ncol(swiss), function(x) nonsense_lm(swiss, x))
lapply(nonsense, function(x)summary(x)$rsq)

 [[1]]
 [1] 0.9703145 0.8558076 0.7054873 0.5660736 0.4474043

 [[2]]
 [1] 0.8558076 0.8121654 0.5785572 0.4961365 0.2715248

 [[3]]
 [1] 0.7844437 0.7729180 0.7054873 0.4961365 0.2132834

 [[4]]
 [1] 0.7729180 0.5456765 0.4474043 0.2715248 0.2137402

 [[5]]
 [1] 0.5785572 0.5660736 0.5135628 0.2137402 0.2132834

 [[6]]
 [1] 0.9703145 0.8121654 0.7844437 0.5456765 0.5135628

同样,请注意,R ^ 2是有效的统计“垃圾”。 进行适当的问题测试是任何分析的最关键步骤。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM