[英]Find groups of duplicates in data frame by all columns except one
I have a large dataframe. 我有一个大型数据帧。 For some purposes I need to do the following:
出于某些目的,我需要执行以下操作:
I have wrote a function for this task, but it works slow because of nested loop. 我已经为这个任务编写了一个函数,但由于嵌套循环,它的工作很慢。 I would like to get some ideas how this code can be improved.
我想知道如何改进这些代码。
Say we have a dataframe like this: 假设我们有这样的数据帧:
V1 V2 V3 V4
1 1 2 1 2
2 1 2 2 1
3 1 1 1 2
4 1 1 2 1
5 2 2 1 2
And we want to get this list as a output: 我们希望将此列表作为输出:
diff.dataframe("V2", conf.new, conf.new)
Ouput: 输出继电器:
$`1`
[1] 1
$`2`
[1] 2
$`3`
[1] 1 3
$`4`
[1] 2 4
$`5`
[1] 5
The following code reaces the goal, but it works too slow. 以下代码重新实现了目标,但效果太慢。 Is it possible to improve it somehow?
是否有可能以某种方式改善它?
diff.dataframe <- function(param, df1, df2){
excl.names <- c(param)
df1.excl <- data.frame(lapply(df1[, !names(df1) %in% excl.names], as.character), stringsAsFactors=FALSE)
df2.excl <- data.frame(lapply(df2[, !names(df2) %in% excl.names], as.character), stringsAsFactors=FALSE)
list.out <- list()
for (i in 1:nrow(df1.excl)){
for (j in 1:nrow(df2.excl)){
if (paste(df1.excl[i,],collapse='') == paste(df2.excl[j,], collapse='')){
if (!as.character(i) %in% unlist(list.out)){
list.out[[as.character(i)]] <- c(list.out[[as.character(i)]], j)
}
}
}
}
return(list.out)
}
Let's generate some data first 让我们先生成一些数据
df <- as.data.frame(matrix(sample(2, 20, TRUE), 5))
# Produces df like this
V1 V2 V3 V4
1 2 1 1 1
2 2 1 2 2
3 1 1 2 2
4 1 2 1 1
5 1 2 1 1
We then loop through the lines with lapply
. 然后我们用
lapply
遍历这些线。 Each row i
is then compared to all lines of df
with apply
(including itself). 然后将每行
i
与apply
(包括其自身)的所有df
行进行比较。 The rows with <= 1 differences returns TRUE
, the others return FALSE
producing a logical vector, which we convert to a numeric vector with which
. 与<= 1种的不同的行返回
TRUE
,其他返回FALSE
产生一个逻辑向量,我们转换成与数字向量which
。
lapply(1:nrow(df), function(i)
apply(df, 1, function(x) which(sum(x != df[i,]) <= 1)))
# Produces output like this
[[1]]
[1] 1
[[2]]
[1] 2 3
[[3]]
[1] 2 3
[[4]]
[1] 4 5
[[5]]
[1] 4 5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.