简体   繁体   中英

Find duplicated rows with original

I can get duplicated rows in R on a data.table dt using

dt[duplicated(dt, by=someColumns)] 

However, I would like to get pairs of duplicated rows and the "non-duplicates", for example consider dt :

col1, col2, col3 
   A     B    C1
   A     B    C2
   A    B1    C1

Now, dt[duplicated(dt, by=c('col1', "col2")) would give me something along the lines of

col1, col2, col3
   A     B    C2

I would like to get this together with the row that it did not chose to be duplicated, that is

col1, col2, col3 
   A     B    C1
   A     B    C2

Speed comparison of answers:

> system.time(dt[duplicated(dt2, by = t) | duplicated(dt, by = t, fromLast = TRUE)])
   user  system elapsed 
  0.008   0.000   0.009 
> system.time(dt[, .SD[.N > 1], by = t])
   user  system elapsed 
 77.555   0.100  77.703 

I believe this is essentially a duplicate of this question, though i can see how you may not have found it...

...here's an answer building off the logic outlined in the referenced question:

dt <- read.table(text = "col1 col2 col3 
   A     B    C1
   A     B    C2
   A    B1    C1", header = TRUE, stringsAsFactors = FALSE)


idx <- duplicated(dt[, 1:2]) | duplicated(dt[, 1:2], fromLast = TRUE)

dt[idx, ]
#---
  col1 col2 col3
1    A    B   C1
2    A    B   C2

Since you are using data.table , this is probably what you want:

library(data.table)
dt <- data.table(dt)
dt[duplicated(dt, by = c("col1", "col2")) | duplicated(dt, by = c("col1", "col2"), fromLast = TRUE)]
#---
   col1 col2 col3
1:    A    B   C1
2:    A    B   C2

You can easily achieve this just by using .N :

dt[, .SD[.N > 1], by = list(col1, col2)]
##    col1 col2 col3
## 1:    A    B   C1
## 2:    A    B   C2

Edit:

You can also try to use binary search which is very efficient, though it seems like duplicated is still more efficient

setkey(dt[, indx := .N, by = list(col1, col2)], indx)[!J(1)]
##    col1 col2 col3
## 1:    A    B   C1
## 2:    A    B   C2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM