简体   繁体   English

如何为stepAIC添加特定条件

[英]How to add specific conditions to stepAIC

I am running a regression with 37 variables, and I am using stepAIC to perform model selection. 我正在使用37个变量进行回归,并且正在使用stepAIC进行模型选择。 I do NOT want a predictive model. 我不需要预测模型。 I just want to find out what varibles have the best explanatory power. 我只想了解哪些变量具有最佳的解释能力。

My current code looks like: 我当前的代码如下:

fitObject <- lm(mydata)
DEP.select <- stepAIC(fitObject, direction = 'both', scope= list(lower = ~AUC), trace = F, k = log(obs))
# DEP is my dependent variable, and AUC is an independent variable I was want to have in my model.

The problem is that a lot of my variables have high correlation, and the result stepAIC gives me contains several of those highly correlated variables. 问题是我的很多变量都具有很高的相关性,而stepAIC给我的结果包含了其中一些高度相关的变量。 Notice that I have forced AUC in the model, multicollinearity is a problem especially when those variables highly correlated with AUC are chosen in the model. 请注意,我已经在模型中强制使用AUC,多重共线性是一个问题,尤其是在模型中选择了与AUC高度相关的变量时。

Is there a way to specify in the function some thresholds for correlation or p-value of the coefficients? 是否可以在函数中指定一些相关系数或p值的阈值?

Or any comments on other approaches that can solve my problem are welcome. 或者欢迎对可以解决我的问题的其他方法发表评论。

Thank you! 谢谢!

Perhaps Variance Inflation Factor will work better for you. 也许方差通货膨胀系数会更适合您。 This article explains some of the logic. 本文介绍了一些逻辑。 http://en.wikipedia.org/wiki/Variance_inflation_factor http://en.wikipedia.org/wiki/Variance_inflation_factor

Example use: 使用示例:

v=ezvif(df,yvar ='columnNameOfWhichYouAreTryingToPredict')

Here is the function I wrote that combines VIF::vif with cross validation. 这是我编写的将VIF :: vif与交叉验证结合在一起的函数。

require(VIF)
require(cvTools);
#returns selected variables using VIF and kfolds cross validation 
ezvif=function(df,yvar,folds=5,trace=F){
  f=cvFolds(nrow(df),K=folds);
  findings=list();
  for(v in names(df)){
    if(v==yvar)next;
    findings[[v]]=0; 
  }
  for(i in 1:folds){   
    rows=f$subsets[f$which!=i]
    y=df[rows,yvar];
    xdf=df[rows,names(df) != yvar]; #remove output var    
    vifResult=vif(y,xdf,trace=trace,subsize=min(200,floor(nrow(xdf))))
    for(v in names(xdf)[vifResult$select]){
      findings[[v]]=findings[[v]]+1; #vote
    }
  }
  findings=(sort(unlist(findings),decreasing = T))    
  if(trace) print(findings[findings>0]); 
  return( c(yvar,names(findings[findings==findings[1]])) )  
}

I would recommend to remove the variables with high correlations. 我建议删除具有高相关性的变量。 The libraries caret and corrplot can help: 库插入符和更正可以帮助:

library(corrplot)
library(caret)
dm = data.matrix(mydata[,names(mydata) != 'DEP'] #without your outcome var

Visualize your correlations clustering highly correlated together 可视化您的相关性,将高度相关的聚类在一起

corrplot(cor(dm), order = 'hclust')

And find the indices of variables that you could remove due to high (>0.75) correlations 并找到由于高(> 0.75)相关性而可以删除的变量的索引

findCorrelations(cor(dm), 0.75)

Removing these variables can improve your model. 删除这些变量可以改善您的模型。 After removing the variables, continue doing the stepAIC as you described in your question. 删除变量后,按照问题中的说明继续执行stepAIC。

To assess multicollinearity between predictors when running the dredge function (MuMIn package), include the following max.r function as the "extra" argument: 若要在运行挖泥函数(MuMIn程序包)时评估预测变量之间的多重共线性,请包含以下max.r函数作为“额外”参数:

max.r <- function(x){
  corm <- cov2cor(vcov(x))
  corm <- as.matrix(corm)
  if (length(corm)==1){
    corm <- 0
    max(abs(corm))
  } else if (length(corm)==4){
  cormf <- corm[2:nrow(corm),2:ncol(corm)]
  cormf <- 0
  max(abs(cormf))
  } else {
    cormf <- corm[2:nrow(corm),2:ncol(corm)]
    diag(cormf) <- 0
    max(abs(cormf))
  }
}

then simply run dredge specifying the number of predictor variables and including the max.r function: 然后只需运行dredge,指定预测变量的数量并包括max.r函数即可:

options(na.action = na.fail)
Allmodels <- dredge(Fullmodel, rank = "AIC", m.lim=c(0, 3), extra= max.r) 
Allmodels[Allmodels$max.r<=0.6, ] ##Subset models with max.r <=0.6 (not collinear)
NCM <- get.models(Allmodels, subset = max.r<=0.6) ##Retrieve models with max.r <=0.6 (not collinear)
model.sel(NCM) ##Final model selection table

This works for lme4 models. 这适用于lme4模型。 For nlme models see: https://github.com/rojaff/dredge_mc 对于nlme模型,请参见: https : //github.com/rojaff/dredge_mc

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM