简体   繁体   English

使用R中的相关矩阵自动删除共线变量

[英]Automatically remove colinear variables using correlation matrix in R

I am trying to iteratively and automatically create pairs of independent variables that are correlated based on a matrix, which I will then put into a regression model to remove the least significant one of the pair. 我试图迭代地自动创建基于矩阵相关的独立变量对,然后我将其放入回归模型中以移除对中最不重要的一个。

So far, my code looks like this: 到目前为止,我的代码看起来像这样:

#Correlation testing on only numeric variables
num.cols <- bind.data[,sapply(bind.data, is.numeric),with=FALSE]
cor.matrix <- cor(num.cols,use="complete.obs")

#Create a table of all pairings of potentially colinear variables
#start the process by hardcoding the first 2 iterations
cor.vars1 <- expand.grid(var1 = colnames(cor.matrix)[1], 
                     var2 = row.names(cor.matrix[which((abs(cor.matrix[,1]) > cor.cutoff) & (abs(cor.matrix[,1]) != 1)),]))
cor.vars1 <- as.data.table(cor.vars1)

cor.vars2 <- expand.grid(var1 = colnames(cor.matrix)[2], 
                     var2 = row.names(cor.matrix[which((abs(cor.matrix[,2]) > cor.cutoff) & (abs(cor.matrix[,2]) != 1)),]))
cor.vars2 <- as.data.table(cor.vars2)
cor.vars <- rbind(cor.vars1, cor.vars2)

#now create for-loop to automatically do the rest
for (i in 3:length(num.cols)) {
  cor.varsn <- expand.grid(var1 = colnames(cor.matrix)[i], 
                       var2 = row.names(cor.matrix[which((abs(cor.matrix[,i]) > cor.cutoff) & (abs(cor.matrix[,i]) != 1)),]))
  cor.vars <- rbind(cor.vars, cor.varsn)
}

The basic idea is for every column in the correlation matrix, I want an expand grid made of the column name and every row name of variables with correlations to the column variable greater than some cutoff "cor.cutoff". 基本思想是对于相关矩阵中的每一列,我想要一个由列名称和变量的每个行名称组成的扩展网格,其中列变量的相关性大于一些截止“cor.cutoff”。 I will do this for each column and rbind them all. 我将为每一列做这个并将它们全部绑定。 At the end I will have a 2-column data.table where each row represents a pairing of correlated independent predictor variables. 最后,我将有一个2列data.table,其中每一行代表一组相关的独立预测变量。

My problem is that the for-loop breaks when it gets to a column with no correlations to other variables. 我的问题是for循环在到达与其他变量没有相关性的列时会中断。 Rather than skipping on to the next column that fulfills the requirement, it stops completely. 它不是跳到满足要求的下一列,而是完全停止。 Is there an elegant way to do this rather than an "if" statement? 是否有一种优雅的方式来做这个而不是“如果”的陈述? Particularly when the first column is the problem (ie cor.vars1 or cor.vars2 have no correlations to other variables. 特别是当第一列是问题时(即cor.vars1或cor.vars2与其他变量没有相关性)。

You can try the following function: 您可以尝试以下功能:

rmcl = function(cor_mat, threshold) {
  cor_mat = abs(cor_mat)
  stopifnot(sum(cor_mat[is.na(cor_mat)]) == 0) 
  for (i in 1:(nrow(cor_mat) - 1)) {
    for (j in (i+1):ncol(cor_mat)) {
      if(cor_mat[i, j] > threshold) {
        cor_mat[i, ] = rep(NA, ncol(cor_mat))
        break
      }
    }
  }
  idx = which(!is.na(cor_mat[, 1]))
  cor_mat[idx, idx]
}

The trick is to preserve the shape of the correlation matrix so that the iteration can continue. 诀窍是保留相关矩阵的形状,以便迭代可以继续。

An example: 一个例子:

data = data.frame(x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100))
# the last 3 columns are in perfect correlation with the first 3.
data$x4 = 2 * data$x1
data$x5 = 2 * data$x2
data$x6 = 2 * data$x3
c = cor(data)
c = rmcl(c, .9)
c


          x4         x5         x6
x4 1.0000000 0.08847400 0.03297110
x5 0.0884740 1.00000000 0.09915481
x6 0.0329711 0.09915481 1.00000000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM