简体   繁体   English

在一个data.frame中查找具有相同数据的列

[英]Find columns with same data in one data.frame

I have 1 data.frame named A, there are 5000 columns in it. 我有1个名为A的data.frame,其中有5000列。 How can I find columns in this data.frame that are equal to each other. 我如何在此data.frame中找到彼此相等的列。

As @John mentioned, there are problems with using duplicated . 正如@John所提到的,使用duplicated存在一些问题。 I would add that transposing the data.frame forces all the data into a same data type before it is even compared with duplicated . 我要补充一点,即转置data.frame会强制将所有数据转换为同一数据类型,然后再将其与duplicated数据进行比较。 On an example, here is a data.frame: 在一个示例中,这是一个data.frame:

df <- data.frame( a = LETTERS[1:3],
                  b = 1:3,
                  c = as.character(1:3),
                  d = LETTERS[1:3],
                  e = 1:3,
                  f = 1:3)
df
#   a b c d e f
# 1 A 1 1 A 1 1
# 2 B 2 2 B 2 2
# 3 C 3 3 C 3 3

Note that column c is very similar to columns b , e , and f , but not identical because of the different types (character versus numeric). 请注意,列c与列bef非常相似,但由于类型不同(字符与数字)不同,它们并不相同。 The solution suggested by @Jubbles would disregard these differences. @Jubbles建议的解决方案将忽略这些差异。

Instead, it seems more appropriate to use the identical function on the columns of your data.frame. 相反,在data.frame的列上使用identical函数似乎更合适。 You can compare columns two-by-two using outer : 您可以使用两个由两列比较outer

are.cols.identical <- function(col1, col2) identical(df[,col1], df[,col2])
identical.mat      <- outer(colnames(df), colnames(df),
                            FUN = Vectorize(are.cols.identical))
identical.mat
# [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
# [1,]  TRUE FALSE FALSE  TRUE FALSE FALSE
# [2,] FALSE  TRUE FALSE FALSE  TRUE  TRUE
# [3,] FALSE FALSE  TRUE FALSE FALSE FALSE
# [4,]  TRUE FALSE FALSE  TRUE FALSE FALSE
# [5,] FALSE  TRUE FALSE FALSE  TRUE  TRUE
# [6,] FALSE  TRUE FALSE FALSE  TRUE  TRUE

From here, you can use clustering to identify groups of identical columns (there may be better ways so if you know one, feel free to comment or even edit my answer.) 在这里,您可以使用聚类来标识相同列的组(可能有更好的方法,因此,如果您知道的话,可以随时发表评论,甚至编辑我的答案。)

library(cluster)
distances <- as.dist(!identical.mat)
tree      <- hclust(distances)
cut       <- cutree(tree, h = 0.5)
cut
# [1] 1 2 3 1 2 2

split(colnames(df), cut)
# $`1`
# [1] "a" "d"
# 
# $`2`
# [1] "b" "e" "f"
# 
# $`3`
# [1] "c"

Edit 1: to disregard differences in floating point values, one can use 编辑1:忽略浮点值的差异,可以使用

are.cols.identical <- function(col1,col2) isTRUE(all.equal((df[,col1],df[,col2]))

Edit 2: a more efficient method than clustering for grouping the names of identical columns is 编辑2:比聚类更有效的方法是对相同列的名称进行分组

cut <- apply(identical.mat, 1, function(x)match(TRUE, x))
split(colnames(df), cut)

This question is very similar to the one here , with subtle differences yet with the same caveats. 这个问题与这里的问题非常相似,只是存在细微的差异,但有相同的警告。

I would again suggest using digest() , as in the following (thanks to @flodel for the data.frame and for a very nice suggestion above) 我再次建议使用digest() ,如下所示(感谢@flodel的data.frame和上面的一个非常好的建议)

df <- data.frame( a = LETTERS[1:3],
  b = 1:3,
  c = as.character(1:3),
  d = LETTERS[1:3],
  e = 1:3,
  f = 1:3)

dfDig <- sapply(df, digest)

ansL <- lapply(seq_along(dfDig), function(x) names(which(dfDig == dfDig[x])))

unique(ansL)

# [[1]]
# [1] "a" "d"

# [[2]]
# [1] "b" "e" "f"

# [[3]]
# [1] "c"

This still won't distinguish between 1.0 and 1 , though. 不过,这仍然无法区分1.01

EDIT 编辑

As suggested in the comments by @flodel, the following can be used alternatively after creating dfDig 正如@flodel的注释中所建议的那样,在创建dfDig之后,可以替代使用以下dfDig

split(colnames(df), vapply(dfDig, match, 1L, dfDig))

How about transposing the dataframe and using duplicated() ? 如何转置数据框并使用duplicated()

B <- as.data.frame(t(A))
dup1 <- duplicated(B)
# if you want to identify all duplicated rows
dup2 <- duplicated(B, fromLast = TRUE)
dup_final <- dup1 * dup2
saved_colnames <- colnames(A)[dup_final]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM