简体   繁体   English

在除数据列之外的所有列中查找数据框中的重复组

[英]Find groups of duplicates in data frame by all columns except one

I have a large dataframe. 我有一个大型数据帧。 For some purposes I need to do the following: 出于某些目的,我需要执行以下操作:

  • Select one column in this data frame 在此数据框中选择一列
  • Iterate on all rows of a given data frame except selected column 迭代除选定列之外的给定数据帧的所有行
  • Select all rows of this data frame that are equal by all elements except one selected column 选择此数据框的所有行,除了一个选定列之外的所有元素都相同
  • Group them by the way that group name is the row index and group values are indexes of duplicated rows. 按组名称为行索引的方式对它们进行分组,组值为重复行的索引。

I have wrote a function for this task, but it works slow because of nested loop. 我已经为这个任务编写了一个函数,但由于嵌套循环,它的工作很慢。 I would like to get some ideas how this code can be improved. 我想知道如何改进这些代码。

Say we have a dataframe like this: 假设我们有这样的数据帧:

  V1 V2 V3 V4
1  1  2  1  2
2  1  2  2  1
3  1  1  1  2
4  1  1  2  1
5  2  2  1  2

And we want to get this list as a output: 我们希望将此列表作为输出:

diff.dataframe("V2", conf.new, conf.new)

Ouput: 输出继电器:

$`1`
[1] 1

$`2`
[1] 2

$`3`
[1] 1 3

$`4`
[1] 2 4

$`5`
[1] 5

The following code reaces the goal, but it works too slow. 以下代码重新实现了目标,但效果太慢。 Is it possible to improve it somehow? 是否有可能以某种方式改善它?

diff.dataframe <- function(param, df1, df2){
  excl.names <- c(param)
  df1.excl <- data.frame(lapply(df1[, !names(df1) %in% excl.names], as.character), stringsAsFactors=FALSE)
  df2.excl <- data.frame(lapply(df2[, !names(df2) %in% excl.names], as.character), stringsAsFactors=FALSE)
  list.out <- list()

  for (i in 1:nrow(df1.excl)){
     for (j in 1:nrow(df2.excl)){
        if (paste(df1.excl[i,],collapse='') == paste(df2.excl[j,], collapse='')){
          if (!as.character(i) %in% unlist(list.out)){                                                                                                                             
            list.out[[as.character(i)]] <- c(list.out[[as.character(i)]], j)                                                                                                       
          }
        }
     }
  }
  return(list.out)
}

Let's generate some data first 让我们先生成一些数据

df <- as.data.frame(matrix(sample(2, 20, TRUE), 5))

# Produces df like this
  V1 V2 V3 V4
1  2  1  1  1
2  2  1  2  2
3  1  1  2  2
4  1  2  1  1
5  1  2  1  1

We then loop through the lines with lapply . 然后我们用lapply遍历这些线。 Each row i is then compared to all lines of df with apply (including itself). 然后将每行iapply (包括其自身)的所有df行进行比较。 The rows with <= 1 differences returns TRUE , the others return FALSE producing a logical vector, which we convert to a numeric vector with which . 与<= 1种的不同的行返回TRUE ,其他返回FALSE产生一个逻辑向量,我们转换成与数字向量which

lapply(1:nrow(df), function(i)
    apply(df, 1, function(x) which(sum(x != df[i,]) <= 1)))

# Produces output like this
[[1]]
[1] 1

[[2]]
[1] 2 3

[[3]]
[1] 2 3

[[4]]
[1] 4 5

[[5]]
[1] 4 5

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 应用于data.frame中除一列之外的所有列,并替换R中的数据 - Lapply to all columns in a data.frame except one and replace the data in R 如何在R data.frame中的所有行和列中查找检测单个值的重复项 - How to find detect duplicates of single values in all rows and columns in R data.frame 如何在向量的所有元素上循环函数,除了一个元素,并将结果存储在数据框的单独列中 - How to loop a function over all elements of a vector except one and store the result in separate columns of a data frame 如何合并 r 中两个数据集的两列,并包括一个数据帧中的所有元素,除非它们是 NA? - How do I merge two columns from two datasets in r and include all the elements from one data frame except when they are NA? 除了一个之外,平均所有变量组的值 - Average the value of all the variables groups except one 在一个data.frame中查找具有相同数据的列 - Find columns with same data in one data.frame 计算数据框中列组的平均值 - calculating the means of groups of columns in a data frame 在数据框内嵌套几组列 - Nesting several groups of columns inside a data frame 查找 data.frame 中所有具有非前导 NA 值的列 - find all columns in data.frame with non leading NA values 根据第一列中的值,将函数应用于数据框中除第一列之外的所有行和列 - apply function to all rows and columns in data frame except first column based on value in first column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM