简体   繁体   English

检查数据框中的任何列是否在R中相同

[英]Check if any columns within a data frame are identical in R

I am iteratively fitting models to many different variables, and in a few rare cases two columns I am using as independent variables contain an identical set of values. 我迭代地将模型拟合到许多不同的变量,在少数情况下,我用作独立变量的两列包含一组相同的值。 This makes the model unidentifiable and throws an error. 这使得模型无法识别并抛出错误。 I would like a way to check if any columns are identical to any other columns within a dataframe, and then return the names of the columns that have a problem. 我想要一种方法来检查是否有任何列与数据框中的任何其他列相同,然后返回有问题的列的名称。 Here is an example dataframe. 这是一个示例数据帧。

a <- rnorm(10)
b <- rnorm(10)
c <- a
d <- rnorm(10)
dat <- data.frame(a,b,c,d)

Folks have answered how to test if two individual columns in a dataframe are identical here . 人们已经回答了如何测试数据框中的两个单独列是否相同的问题 However, I would like a way to check every column against every other column. 但是,我想要一种方法来检查每列与每一列。

The caret package contains the function findLinearCombos that you might wanna try caret包包含您可能想尝试的函数findLinearCombos

caret::findLinearCombos(dat)
#$linearCombos
#$linearCombos[[1]]
#[1] 3 1


#$remove
#[1] 3

But be aware that the function would also recommend the deletion of a column that is a times minus 1 但请注意,该函数还建议删除a次数减1的列

Second example 第二个例子

dat2 <- data.frame(a,b,c,d, e = -a) 
caret::findLinearCombos(dat2)
#$linearCombos
#$linearCombos[[1]]
#[1] 3 1

#$linearCombos[[2]]
#[1] 5 1


#$remove
#[1] 3 5

You can use combn to get all pairs of column numbers, then apply over the resulting matrix to check if all elements are equal. 您可以使用combn获取所有列数,然后应用于生成的矩阵以检查所有元素是否相等。

pairs <- t(combn(seq_len(ncol(dat)), 2))

same <- apply(pairs, 1, function(x) all(Reduce(`==`, dat[,x])))

pairs[same,]
# [1] 1 3

Or check the correlations (will also include linear combinations) 或检查相关性(还将包括线性组合)

cor1 <- data.frame(which(cor(dat) == 1, arr.ind = T))
cor1[cor1$row > cor1$col,]
#   row col
# c   3   1

You could use the dist function to compute the matrix of distances between your columns, and find the combinations of columns for which the distance is 0. 您可以使用dist函数计算列之间的距离矩阵,并找到距离为0的列组合。

m = as.matrix(dist(t(dat)))
m[upper.tri(m,diag=T)] = NA
which(m<1.5e-8,arr.ind=T)

  row col
c   3   1

Note that this solution will only work for numerical columns. 请注意,此解决方案仅适用于数字列。 If you have qualitative variables in your dataframe, you won't be able to compare them. 如果数据框中有定性变量,则无法对它们进行比较。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM