合并R中的复制列

Question

I have a data frame that is like: 我有一个数据框是这样的：

   c1 c2 c3 c4
 r1 1  0  1  1
 r2 0  0  1  1
 r3 0  1  0  0

In this case, c3 and c4 are exactly the same. 在这种情况下，c3和c4完全相同。 I would like to remove duplicate columns but keep column names of both c3 and c4, to get the data frame: 我想删除重复的列，但保留c3和c4的列名，以获取数据框：

where the third column name joins the column names of the identical columns. 第三列名称与相同列的列名称结合在一起。

I feel like there should be an elegant way to do this that I just can't think of. 我觉得应该有一种我无法想到的优雅方法。 Any help would be greatly appreciated! 任何帮助将不胜感激！

Edit: Just to clarify, that my actual data frames are actually 1000 rows x 1000 columns and I don't know which of the columns are identical. 编辑：只是为了澄清，我的实际数据帧实际上是1000行x 1000列，我不知道哪些列是相同的。 So I need an automatic way of testing if columns are identical and where that is the case to combine the column names. 因此，我需要一种自动的方式来测试列是否相同以及组合列名称的情况。

Answer 1

The extra information adds an interesting wrinkle! 额外的信息会增加有趣的皱纹！ If you don't care about concatenating the names of the columns you could do something like this: 如果您不希望串联列名，可以执行以下操作：

df <- data.frame(c1 = c(1,0,0), c2 = c(0,0,1), c3 = c(1,1,0), c4 = c(1,1,0), c5 = c(1,1,1), c6= c(1,1,1), c7 = c(2,2,2))

library(digest)
df_clean <- df[!duplicated(lapply(df, digest))]

At this point df_clean would contain the data frame without any duplicates. 在这一点上，df_clean将包含没有重复的数据帧。

If the column names are genuinely important, this is how I would go about it after looking at thepule's answer: 如果列名确实很重要，这是我在查看答案的答案后将如何处理的：

df_dups <- df[duplicated(lapply(df, digest))] #extract the duplicates

for (clean_col in 1:ncol(df_clean)){
  for (dup_col in 1:ncol(df_dups)){
    if (identical(df_clean[,clean_col], df_dups[,dup_col]) == TRUE){
      colnames(df_clean)[clean_col] <- paste(colnames(df_clean)[clean_col], colnames(df_dups)[dup_col], sep = "")
    }
  }
}

The output with additional duplicates added for testing looks like this: 添加了用于测试的其他重复项的输出看起来像这样：

'data.frame':   3 obs. of  5 variables:
 $ c1  : num  1 0 0
 $ c2  : num  0 0 1
 $ c3c4: num  1 1 0
 $ c5c6: num  1 1 1
 $ c7  : num  2 2 2

Answer 2

It is maybe not a super elegant solution, but it gets the job done. 这可能不是一个超级优雅的解决方案，但可以完成工作。 If df is your dataframe: 如果df是您的数据帧：

dups <- duplicated(lapply(df, function(x) x))
df_clean <- df[!dups]
df_dups <- df[dups]


for(z in 1: ncol(df_clean)){
  i <- names(df_clean)[z]
  df_clean[i] -> q
  d <- which(
      sapply(df_dups, function(x) {
      ifelse(identical(x,as.vector(sapply(q, function(x) x))), T, F) 
          })
      ) 
  names(df_clean)[z] <- paste0(i, paste(names(df_dups)[d], collapse = ""))
}

The output is: 输出为：

df_clean
   c1 c2 c3c4
r1  1  0    1
r2  0  0    1
r3  0  1    0

This should work also if columns can have multiple duplicates. 如果列可以有多个重复项，这也应该起作用。

合并R中的复制列

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-10-19 18:23:50

At this point df_clean would contain the data frame without any duplicates. 在这一点上，df_clean将包含没有重复的数据帧。

解决方案2
1 2016-10-19 16:41:23

合并R中的复制列

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-10-19 18:23:50

At this point df_clean would contain the data frame without any duplicates. 在这一点上，df_clean将包含没有重复的数据帧。

解决方案2 1 2016-10-19 16:41:23

解决方案1
2 已采纳 2016-10-19 18:23:50

解决方案2
1 2016-10-19 16:41:23