简体   繁体   English

Select 除一列外所有行都是重复的

[英]Select all rows which are duplicates except for one column

I want to find rows in a dataset where the values in all columns, except for one, match.我想在数据集中找到所有列中的值(一列除外)都匹配的行。 After much messing around trying unsuccessfully to get duplicated() to return all instances of the duplicate rows (not just the first instance), I figured out a way to do it (below).在多次尝试让 duplicated() 返回重复行的所有实例(而不仅仅是第一个实例)失败后,我想出了一种方法来做到这一点(如下)。

For example, I want to identify all rows in the Iris dataset that are equal except for Petal.Width.例如,我想识别 Iris 数据集中除 Petal.Width 之外的所有行。

require(tidyverse)
x = iris%>%select(-Petal.Width)
dups = x[x%>%duplicated(),]
answer =  iris%>%semi_join(dups)

> answer 
   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1           5.1         3.5          1.4         0.2    setosa
2           4.9         3.1          1.5         0.1    setosa
3           4.8         3.0          1.4         0.1    setosa
4           5.1         3.5          1.4         0.3    setosa
5           4.9         3.1          1.5         0.2    setosa
6           4.8         3.0          1.4         0.3    setosa
7           5.8         2.7          5.1         1.9 virginica
8           6.7         3.3          5.7         2.1 virginica
9           6.4         2.8          5.6         2.1 virginica
10          6.4         2.8          5.6         2.2 virginica
11          5.8         2.7          5.1         1.9 virginica
12          6.7         3.3          5.7         2.5 virginica

As you can see, that works, but this is one of those times when I'm almost certain that lots other folks need this functionality, and that I'm ignorant of a single function that does this in fewer steps or a generally tidier way.如您所见,这是有效的,但这是我几乎可以肯定很多其他人需要此功能的时候之一,而我不知道一个 function 以更少的步骤或通常更整洁的方式执行此操作. Any suggestions?有什么建议么?

An alternate approach, from at least two other posts, applied to this case would be:来自至少两个其他帖子的替代方法适用于这种情况:

answer = iris[duplicated(iris[-4]) | duplicated(iris[-4], fromLast = TRUE),]

But that also seems like just a different workaround instead of single function. Both approaches take the same amount of time.但这似乎只是一种不同的解决方法,而不是单一的 function。这两种方法花费的时间相同。 (0.08 sec on my system). (在我的系统上为 0.08 秒)。 Is there no neater/faster way of doing this?没有更整洁/更快的方法吗?

eg something like iris%>%duplicates(all=TRUE,ignore=Petal.Width)例如 iris%>%duplicates(all=TRUE,ignore=Petal.Width)

iris[duplicated(iris[,-4]) | duplicated(iris[,-4], fromLast = TRUE),]

Of duplicate rows (regardless of column 4) duplicated(iris[,-4]) gives the second row of the duplicate sets, rows 18, 35, 46, 133, 143 & 145, and duplicated(iris[,-4], fromLast = TRUE) gives the first row per duplicate set, 1, 10, 13, 102, 125 and 129. By adding | 在重复行(与第4列无关)中, duplicated(iris[,-4])给出重复集的第二行,第18、35、46、133、143和145行,以及duplicated(iris[,-4], fromLast = TRUE)给出每个重复集的第一行, duplicated(iris[,-4], fromLast = TRUE)和129 | this results in 12 TRUE s, so it returns the expected output. 结果为12 TRUE ,因此返回预期输出。

Or perhaps with dplyr: Basically you group on all variables except Petal.Width , count how much they occur, and filter those which occur more than once. 或使用dplyr:基本上,您将除Petal.Width之外的所有变量Petal.Width ,计算它们发生的次数,并过滤不止一次出现的那些。

library(dplyr)
iris %>% 
  group_by_at(vars(-Petal.Width)) %>% 
  filter(n() > 1)

   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
          <dbl>       <dbl>        <dbl>       <dbl>    <fctr>
 1          5.1         3.5          1.4         0.2    setosa
 2          4.9         3.1          1.5         0.1    setosa
 3          4.8         3.0          1.4         0.1    setosa
 4          5.1         3.5          1.4         0.3    setosa
 5          4.9         3.1          1.5         0.2    setosa
 6          4.8         3.0          1.4         0.3    setosa
 7          5.8         2.7          5.1         1.9 virginica
 8          6.7         3.3          5.7         2.1 virginica
 9          6.4         2.8          5.6         2.1 virginica
10          6.4         2.8          5.6         2.2 virginica
11          5.8         2.7          5.1         1.9 virginica
12          6.7         3.3          5.7         2.5 virginica

I think janitor can do this somewhat directly.我认为看门人可以直接做到这一点。

library(janitor)

get_dupes(iris, !Petal.Width)

# get_dupes(iris, !Petal.Width)[,names(iris)] # alternative: no count column
   Sepal.Length Sepal.Width Petal.Length   Species dupe_count Petal.Width
1           4.8         3.0          1.4    setosa          2         0.1
2           4.8         3.0          1.4    setosa          2         0.3
3           4.9         3.1          1.5    setosa          2         0.1
4           4.9         3.1          1.5    setosa          2         0.2
5           5.1         3.5          1.4    setosa          2         0.2
6           5.1         3.5          1.4    setosa          2         0.3
7           5.8         2.7          5.1 virginica          2         1.9
8           5.8         2.7          5.1 virginica          2         1.9
9           6.4         2.8          5.6 virginica          2         2.1
10          6.4         2.8          5.6 virginica          2         2.2
11          6.7         3.3          5.7 virginica          2         2.1
12          6.7         3.3          5.7 virginica          2         2.5

I looked into the source of duplicated but would be interested to see if anyone can find anything faster. 我调查了duplicated的来源,但希望知道是否有人可以更快地找到任何东西。 It might involve going to Rcpp or something similar though. 它可能涉及到Rcpp或类似的东西。 On my machine, the base method is the fastest but your original method is actually better than the most readable dplyr method. 在我的机器上,基本方法是最快的,但是您的原始方法实际上比最易读的dplyr方法dplyr I think that wrapping a function like this for your own purposes ought to be sufficient, since your run times don't seem excessively long anyway you can simply do iris %>% opts("Petal.Width") for pipeability if that's the main concern. 我认为为自己的目的包装一个这样的函数应该足够了,因为无论如何您的运行时间似乎都不会太长,如果主要是iris %>% opts("Petal.Width") ,则可以这样做关心。

library(tidyverse)
library(microbenchmark)

opt1 <- function(df, ignore) {
  ignore = enquo(ignore)
  x <- df %>% select(-!!ignore)
  dups <- x[x %>% duplicated(), ]
  answer <- iris %>% semi_join(dups)
}

opt2 <- function(df, ignore) {
  index <-  which(colnames(df) == ignore)
  df[duplicated(df[-index]) | duplicated(df[-index], fromLast = TRUE), ]
}

opt3 <- function(df, ignore){
  ignore <-  enquo(ignore)
  df %>%
    group_by_at(vars(-!!ignore)) %>%
    filter(n() > 1)
}


microbenchmark(
  opt1 = suppressMessages(opt1(iris, Petal.Width)),
  opt2 = opt2(iris, "Petal.Width"),
  opt3 = opt3(iris, Petal.Width)
)
#> Unit: milliseconds
#>  expr      min       lq     mean   median       uq       max neval cld
#>  opt1 3.427753 4.024185 4.851445 4.464072 5.069216 12.800890   100  b 
#>  opt2 1.712975 1.908130 2.403859 2.133632 2.542871  7.557102   100 a  
#>  opt3 6.604614 7.334304 8.461424 7.920369 8.919128 24.255678   100   c

Created on 2018-07-12 by the reprex package (v0.2.0). reprex软件包 (v0.2.0)于2018-07-12创建。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM