在regsubsets（）或其他函数中，具有较高交互作用项并包含主要影响的R中的详尽模型选择是否可行？

Question

I would like to perform automated, exhaustive model selection on a dataset with 7 predictors (5 continuous and 2 categorical) in R. I would like all continuous predictors to have the potential for interaction (at least up to 3 way interactions) and also have non-interacting squared terms. 我想对R中有7个预测变量（5个连续变量和2个分类变量）的数据集执行自动，详尽的模型选择。我希望所有连续预测变量都具有交互的潜力（至少3种交互方式），并且还具有非交互平方项。

I have been using regsubsets() from the leaps package and have gotten good results, however many of the models contain interaction terms without including the main effects as well (eg, g*h is an included model predictor but g is not). 我一直在使用regsubsets()从leaps封装，已经得到了很好的效果，但是很多机型都包含交互项不包括主要的影响，以及（如g*h是一个包含模型预测，但g不是）。 Since inclusion of the main effect as well will affect the model score (Cp, BIC, etc) it is important to include them in comparisons with the other models even if they are not strong predictors. 由于包含主要影响也会影响模型得分（Cp，BIC等），因此即使它们不是很强的预测指标，也必须将它们与其他模型进行比较。

I could manually weed through the results and cross off models that include interactions without main effects but I'd prefer to have an automated way to exclude those. 我可以手动除草结果，并删除包含没有主要影响的交互的模型，但我更希望采用一种自动方法来排除这些影响。 I'm fairly certain this isn't possible with regsubsets() or leaps() , and probably not with glmulti either. 我相当肯定这是不可能的regsubsets()或leaps()也可能不是glmulti无论是。 Does anyone know of another exhaustive model selection function that allows for such specification or have a suggestion for script that will sort the model output and find only models that fit my specs? 有谁知道允许这种规格的另一个详尽的模型选择功能，或者对脚本进行建议，以对模型输出进行排序并仅找到符合我规格的模型？

Below is simplified output from my model searches with regsubsets() . 以下是使用regsubsets()模型搜索得到的简化输出。 You can see that model 3 and 4 do include interaction terms without including all the related main effects. 您可以看到模型3和模型4确实包含了交互项，却没有包含所有相关的主要效果。 If no other functions are known for running a search with my specs then suggestions on easily sub-setting this output to exclude models without the necessary main effects included would be helpful. 如果没有其他功能可以根据我的规格进行搜索，那么建议轻松设置该输出以排除不包含必要主要效果的模型将对您有所帮助。

Model adjR2      BIC            CP          n_pred  X.Intercept.    x1      x2      x3      x1.x2   x1.x3   x2.x3   x1.x2.x3
1   0.470344346 -41.26794246    94.82406866 1       TRUE            FALSE   TRUE    FALSE   FALSE   FALSE   FALSE   FALSE
2   0.437034361 -36.5715963     105.3785057 1       TRUE            FALSE   FALSE   TRUE    FALSE   FALSE   FALSE   FALSE
3   0.366989617 -27.54194252    127.5725366 1       TRUE            FALSE   FALSE   FALSE   TRUE    FALSE   FALSE   FALSE
4   0.625478214 -64.64414719    46.08686422 2       TRUE            TRUE    FALSE   FALSE   FALSE   FALSE   FALSE   TRUE

Answer 1

You can use the dredge() function from the MuMIn package. 您可以使用MuMIn包中的dredge dredge()函数。

See also Subsetting in dredge (MuMIn) - must include interaction if main effects are present . 另请参见“ 挖泥机中的子集（MuMIn）”-如果存在主要影响，则必须包括交互作用。

Answer 2

After working with dredge I found that my models have too many predictors and interactions to run dredge in a reasonable period (I calculated that with 40+ potential predictors it might take 300k hours to complete the search on my computer). 使用dredge后，我发现我的模型有太多的预测变量和交互作用，无法在合理的时间内运行挖泥机（我计算出，使用40多个潜在的预测变量，可能需要30万小时才能在计算机上完成搜索）。 But it does exclude models where interactions don't match with main effects so I imagine that might still be a good solution for many people. 但是它确实排除了交互作用与主要效果不匹配的模型，因此我认为对于许多人来说这仍然可能是一个很好的解决方案。

For my needs I've moved back to regsubsets and have written some code to parse through the search output in order to exclude models that contain terms in interactions that are not included as main effects. 出于我的需要，我已移回regsubsets并编写了一些代码以通过搜索输出进行解析，以便排除在交互作用中包含不包含为主要效应的术语的模型。 This code seems to work well so I'll share it here. 该代码似乎运行良好，因此我将在此处分享。 Warning: it was written with human expediency in mind, not computational, so it could probably be re-coded to be faster. 警告：编写此代码时要考虑到人类的方便，而不是为了计算，因此可能需要重新编码才能更快。 If you've got 100,000s of models to test you might want to make it sleeker. 如果您有十万个模型要测试，则可能要使其更时尚。 (I've been working on searches with ~50,000 models and up to 40 factors which take my 2.4ghz i5 core a few hours to process) （我一直在使用约50,000个型号和多达40个因素进行搜索，这些工作花了我2.4GHz i5内核几个小时的处理时间）

reg.output.search.with.test<- function (search_object) {  ## input an object from a regsubsets search
## First build a df listing model components and metrics of interest
  search_comp<-data.frame(R2=summary(search_object)$rsq,  
                          adjR2=summary(search_object)$adjr2,
                          BIC=summary(search_object)$bic,
                          CP=summary(search_object)$cp,
                          n_predictors=row.names(summary(search_object)$which),
                          summary(search_object)$which)
  ## Categorize different types of predictors based on whether '.' is present
  predictors<-colnames(search_comp)[(match("X.Intercept.",names(search_comp))+1):dim(search_comp)[2]]
  main_pred<-predictors[grep(pattern = ".", x = predictors, invert=T, fixed=T)]
  higher_pred<-predictors[grep(pattern = ".", x = predictors, fixed=T)]
  ##  Define a variable that indicates whether model should be reject, set to FALSE for all models initially.
  search_comp$reject_model<-FALSE  

  for(main_eff_n in 1:length(main_pred)){  ## iterate through main effects
    ## Find column numbers of higher level ters containing the main effect
    search_cols<-grep(pattern=main_pred[main_eff_n],x=higher_pred) 
    ## Subset models that are not yet flagged for rejection, only test these
    valid_model_subs<-search_comp[search_comp$reject_model==FALSE,]  
    ## Subset dfs with only main or higher level predictor columns
    main_pred_df<-valid_model_subs[,colnames(valid_model_subs)%in%main_pred]
    higher_pred_df<-valid_model_subs[,colnames(valid_model_subs)%in%higher_pred]

    if(length(search_cols)>0){  ## If there are higher level pred, test each one
      for(high_eff_n in search_cols){  ## iterate through higher level pred. 
        ##  Test if the intxn effect is present without main effect (working with whole column of models)
        test_responses<-((main_pred_df[,main_eff_n]==FALSE)&(higher_pred_df[,high_eff_n]==TRUE)) 
        valid_model_subs[test_responses,"reject_model"]<-TRUE  ## Set reject to TRUE where appropriate
        } ## End high_eff for
      ## Transfer changes in reject to primary df:
      search_comp[row.names(valid_model_subs),"reject_model"]<-valid_model_subs[,"reject_model"
      } ## End if
    }  ## End main_eff for

  ## Output resulting table of all models named for original search object and current time/date in folder "model_search_reg"
  current_time_date<-format(Sys.time(), "%m_%d_%y at %H_%M_%S")
  write.table(search_comp,file=paste("./model_search_reg/",paste(current_time_date,deparse(substitute(search_object)),
             "regSS_model_search.csv",sep="_"),sep=""),row.names=FALSE, col.names=TRUE, sep=",")
}  ## End reg.output.search.with.test fn

在regsubsets（）或其他函数中，具有较高交互作用项并包含主要影响的R中的详尽模型选择是否可行？

问题描述

2 个解决方案

解决方案1
2 2015-09-26 11:22:37

解决方案2
2 已采纳 2015-10-08 15:01:56

在regsubsets（）或其他函数中，具有较高交互作用项并包含主要影响的R中的详尽模型选择是否可行？

问题描述

2 个解决方案

解决方案1 2 2015-09-26 11:22:37

解决方案2 2 已采纳 2015-10-08 15:01:56

解决方案1
2 2015-09-26 11:22:37

解决方案2
2 已采纳 2015-10-08 15:01:56