在regsubsets（）或其他函數中，具有較高交互作用項並包含主要影響的R中的詳盡模型選擇是否可行？

Question

我想對R中有7個預測變量（5個連續變量和2個分類變量）的數據集執行自動，詳盡的模型選擇。我希望所有連續預測變量都具有交互的潛力（至少3種交互方式），並且還具有非交互平方項。

我一直在使用regsubsets()從leaps封裝，已經得到了很好的效果，但是很多機型都包含交互項不包括主要的影響，以及（如g*h是一個包含模型預測，但g不是）。 由於包含主要影響也會影響模型得分（Cp，BIC等），因此即使它們不是很強的預測指標，也必須將它們與其他模型進行比較。

我可以手動除草結果，並刪除包含沒有主要影響的交互的模型，但我更希望采用一種自動方法來排除這些影響。 我相當肯定這是不可能的regsubsets()或leaps()也可能不是glmulti無論是。 有誰知道允許這種規格的另一個詳盡的模型選擇功能，或者對腳本進行建議，以對模型輸出進行排序並僅找到符合我規格的模型？

以下是使用regsubsets()模型搜索得到的簡化輸出。 您可以看到模型3和模型4確實包含了交互項，卻沒有包含所有相關的主要效果。 如果沒有其他功能可以根據我的規格進行搜索，那么建議輕松設置該輸出以排除不包含必要主要效果的模型將對您有所幫助。

Model adjR2      BIC            CP          n_pred  X.Intercept.    x1      x2      x3      x1.x2   x1.x3   x2.x3   x1.x2.x3
1   0.470344346 -41.26794246    94.82406866 1       TRUE            FALSE   TRUE    FALSE   FALSE   FALSE   FALSE   FALSE
2   0.437034361 -36.5715963     105.3785057 1       TRUE            FALSE   FALSE   TRUE    FALSE   FALSE   FALSE   FALSE
3   0.366989617 -27.54194252    127.5725366 1       TRUE            FALSE   FALSE   FALSE   TRUE    FALSE   FALSE   FALSE
4   0.625478214 -64.64414719    46.08686422 2       TRUE            TRUE    FALSE   FALSE   FALSE   FALSE   FALSE   TRUE

Answer 1

您可以使用MuMIn包中的dredge dredge()函數。

另請參見“ 挖泥機中的子集（MuMIn）”-如果存在主要影響，則必須包括交互作用。

Answer 2

使用dredge后，我發現我的模型有太多的預測變量和交互作用，無法在合理的時間內運行挖泥機（我計算出，使用40多個潛在的預測變量，可能需要30萬小時才能在計算機上完成搜索）。 但是它確實排除了交互作用與主要效果不匹配的模型，因此我認為對於許多人來說這仍然可能是一個很好的解決方案。

出於我的需要，我已移回regsubsets並編寫了一些代碼以通過搜索輸出進行解析，以便排除在交互作用中包含不包含為主要效應的術語的模型。 該代碼似乎運行良好，因此我將在此處分享。 警告：編寫此代碼時要考慮到人類的方便，而不是為了計算，因此可能需要重新編碼才能更快。 如果您有十萬個模型要測試，則可能要使其更時尚。 （我一直在使用約50,000個型號和多達40個因素進行搜索，這些工作花了我2.4GHz i5內核幾個小時的處理時間）

reg.output.search.with.test<- function (search_object) {  ## input an object from a regsubsets search
## First build a df listing model components and metrics of interest
  search_comp<-data.frame(R2=summary(search_object)$rsq,  
                          adjR2=summary(search_object)$adjr2,
                          BIC=summary(search_object)$bic,
                          CP=summary(search_object)$cp,
                          n_predictors=row.names(summary(search_object)$which),
                          summary(search_object)$which)
  ## Categorize different types of predictors based on whether '.' is present
  predictors<-colnames(search_comp)[(match("X.Intercept.",names(search_comp))+1):dim(search_comp)[2]]
  main_pred<-predictors[grep(pattern = ".", x = predictors, invert=T, fixed=T)]
  higher_pred<-predictors[grep(pattern = ".", x = predictors, fixed=T)]
  ##  Define a variable that indicates whether model should be reject, set to FALSE for all models initially.
  search_comp$reject_model<-FALSE  

  for(main_eff_n in 1:length(main_pred)){  ## iterate through main effects
    ## Find column numbers of higher level ters containing the main effect
    search_cols<-grep(pattern=main_pred[main_eff_n],x=higher_pred) 
    ## Subset models that are not yet flagged for rejection, only test these
    valid_model_subs<-search_comp[search_comp$reject_model==FALSE,]  
    ## Subset dfs with only main or higher level predictor columns
    main_pred_df<-valid_model_subs[,colnames(valid_model_subs)%in%main_pred]
    higher_pred_df<-valid_model_subs[,colnames(valid_model_subs)%in%higher_pred]

    if(length(search_cols)>0){  ## If there are higher level pred, test each one
      for(high_eff_n in search_cols){  ## iterate through higher level pred. 
        ##  Test if the intxn effect is present without main effect (working with whole column of models)
        test_responses<-((main_pred_df[,main_eff_n]==FALSE)&(higher_pred_df[,high_eff_n]==TRUE)) 
        valid_model_subs[test_responses,"reject_model"]<-TRUE  ## Set reject to TRUE where appropriate
        } ## End high_eff for
      ## Transfer changes in reject to primary df:
      search_comp[row.names(valid_model_subs),"reject_model"]<-valid_model_subs[,"reject_model"
      } ## End if
    }  ## End main_eff for

  ## Output resulting table of all models named for original search object and current time/date in folder "model_search_reg"
  current_time_date<-format(Sys.time(), "%m_%d_%y at %H_%M_%S")
  write.table(search_comp,file=paste("./model_search_reg/",paste(current_time_date,deparse(substitute(search_object)),
             "regSS_model_search.csv",sep="_"),sep=""),row.names=FALSE, col.names=TRUE, sep=",")
}  ## End reg.output.search.with.test fn

在regsubsets（）或其他函數中，具有較高交互作用項並包含主要影響的R中的詳盡模型選擇是否可行？

問題描述

2 個解決方案

解決方案1
2 2015-09-26 11:22:37

解決方案2
2 已采納 2015-10-08 15:01:56

在regsubsets（）或其他函數中，具有較高交互作用項並包含主要影響的R中的詳盡模型選擇是否可行？

問題描述

2 個解決方案

解決方案1 2 2015-09-26 11:22:37

解決方案2 2 已采納 2015-10-08 15:01:56

解決方案1
2 2015-09-26 11:22:37

解決方案2
2 已采納 2015-10-08 15:01:56