简体   繁体   English

合并R中的复制列

[英]Consolidate replicated columns in R

I have a data frame that is like: 我有一个数据框是这样的:

   c1 c2 c3 c4
 r1 1  0  1  1
 r2 0  0  1  1
 r3 0  1  0  0

In this case, c3 and c4 are exactly the same. 在这种情况下,c3和c4完全相同。 I would like to remove duplicate columns but keep column names of both c3 and c4, to get the data frame: 我想删除重复的列,但保留c3和c4的列名,以获取数据框:

   c1 c2 c3c4 
 r1 1  0  1
 r2 0  0  1
 r3 0  1  0

where the third column name joins the column names of the identical columns. 第三列名称与相同列的列名称结合在一起。

I feel like there should be an elegant way to do this that I just can't think of. 我觉得应该有一种我无法想到的优雅方法。 Any help would be greatly appreciated! 任何帮助将不胜感激!

Edit: Just to clarify, that my actual data frames are actually 1000 rows x 1000 columns and I don't know which of the columns are identical. 编辑:只是为了澄清,我的实际数据帧实际上是1000行x 1000列,我不知道哪些列是相同的。 So I need an automatic way of testing if columns are identical and where that is the case to combine the column names. 因此,我需要一种自动的方式来测试列是否相同以及组合列名称的情况。

The extra information adds an interesting wrinkle! 额外的信息会增加有趣的皱纹! If you don't care about concatenating the names of the columns you could do something like this: 如果您不希望串联列名,可以执行以下操作:

df <- data.frame(c1 = c(1,0,0), c2 = c(0,0,1), c3 = c(1,1,0), c4 = c(1,1,0), c5 = c(1,1,1), c6= c(1,1,1), c7 = c(2,2,2))

library(digest)
df_clean <- df[!duplicated(lapply(df, digest))]

At this point df_clean would contain the data frame without any duplicates. 在这一点上,df_clean将包含没有重复的数据帧。

If the column names are genuinely important, this is how I would go about it after looking at thepule's answer: 如果列名确实很重要,这是我在查看答案的答案后将如何处理的:

df_dups <- df[duplicated(lapply(df, digest))] #extract the duplicates

for (clean_col in 1:ncol(df_clean)){
  for (dup_col in 1:ncol(df_dups)){
    if (identical(df_clean[,clean_col], df_dups[,dup_col]) == TRUE){
      colnames(df_clean)[clean_col] <- paste(colnames(df_clean)[clean_col], colnames(df_dups)[dup_col], sep = "")
    }
  }
}

The output with additional duplicates added for testing looks like this: 添加了用于测试的其他重复项的输出看起来像这样:

'data.frame':   3 obs. of  5 variables:
 $ c1  : num  1 0 0
 $ c2  : num  0 0 1
 $ c3c4: num  1 1 0
 $ c5c6: num  1 1 1
 $ c7  : num  2 2 2

It is maybe not a super elegant solution, but it gets the job done. 这可能不是一个超级优雅的解决方案,但可以完成工作。 If df is your dataframe: 如果df是您的数据帧:

dups <- duplicated(lapply(df, function(x) x))
df_clean <- df[!dups]
df_dups <- df[dups]


for(z in 1: ncol(df_clean)){
  i <- names(df_clean)[z]
  df_clean[i] -> q
  d <- which(
      sapply(df_dups, function(x) {
      ifelse(identical(x,as.vector(sapply(q, function(x) x))), T, F) 
          })
      ) 
  names(df_clean)[z] <- paste0(i, paste(names(df_dups)[d], collapse = ""))
}

The output is: 输出为:

df_clean
   c1 c2 c3c4
r1  1  0    1
r2  0  0    1
r3  0  1    0

This should work also if columns can have multiple duplicates. 如果列可以有多个重复项,这也应该起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM