简体   繁体   English

使用data.table在子组中查找相同的行

[英]Finding identical rows in subgroups with data.table

My table has two IDs. 我的表有两个ID。 I'd like, for each value of the 1st ID, to find whether two rows with different value of the 2nd ID are identical (excluding the column of the 2nd ID..). 我想针对第一个ID的每个值,查找第二个ID的值不同的两行是否相同(不包括第二个ID的列。)。 A table very similar (but much much smaller) then mine is: 与我的表非常相似(但小得多)的表是:

library(data.table)

DT <- data.table(id   = rep(LETTERS, each=10),
                 var1 = rnorm(260),
                 var2 = rnorm(260))


DT[, id2 := sample(c("A","B"), 10, T), by=id] # I need this to simulate different 
                                              # distribution of the id2 values, for
                                              # each id value, like in my real table

setkey(DT, id, id2)

DT$var1[1] <- DT$var1[2] # this simulates redundances
DT$var2[1] <- DT$var2[2] # inside same id and id2

DT$var1[8] <- DT$var1[2] # this simulates two rows with different id2
DT$var2[8] <- DT$var2[2] # and same var1 and var2. I'm after such rows!

> head(DT, 10)
    id           var1           var2 id2
 1:  A  0.11641260243  0.52202152686   A
 2:  A  0.11641260243  0.52202152686   A
 3:  A -0.46631312530  1.16263285108   A
 4:  A -0.01301484819  0.44273945065   A
 5:  A  1.84623329221 -0.09284888054   B
 6:  A -1.29139503119 -1.90194818212   B
 7:  A  0.96073555968 -0.49326620160   B
 8:  A  0.11641260243  0.52202152686   B
 9:  A  0.86254993530 -0.21280899589   B
10:  A  1.41142798959  1.13666002123   B

I'm currently using this code: 我目前正在使用此代码:

res <- DT[, {a=unique(.SD)[,-3,with=F]   # Removes redundances like in row 1 and 2
                                         # and then removes id2 column.
             !identical(a, unique(a))},  # Looks for identical rows
          by=id]                         # (in var1 and var2 only!)

> head(res, 3)
   id    V1
1:  A  TRUE
2:  B FALSE
3:  C FALSE

Everything seems to work, but with my real table (almost 80M rows and 4,5M of unique(DT$id) ) my code takes 2,1 hours. 一切似乎都可以正常工作,但是对于我的实际表(几乎80M的行和4,5M的unique(DT$id) ),我的代码需要2.1个小时。

Has anybody got some tips to speed up the code above? 有没有人有一些技巧来加快上面的代码? Am I eventually not following the best practices needed to benefit from the data.table capabilities? 我最终是否没有遵循从data.table功能中受益所需的最佳实践? Thanks anyone in advance! 预先感谢任何人!

EDIT: 编辑:

some timings to compare my code with @Arun 's: 一些时间将我的代码与@Arun的代码进行比较:

DT <- data.table(id   = rep(LETTERS,each=10000),
                 var1 = rnorm(260000),
                 var2 = rnorm(260000))

DT[, id2 := sample(c("A","B"), 10000, T), by=id] # I need this to simulate different 

setkey(DT)

> system.time(unique(DT)[, any(duplicated(.SD)), by = id, .SDcols = c("var1", "var2")])
   user  system elapsed 
   0.48    0.00    0.49 
> system.time(DT[, {a=unique(.SD)[,-3,with=F]   
+                   any(duplicated(a))}, 
+    by=id])
   user  system elapsed 
   1.09    0.00    1.10 

I think I got what I wanted! 我想我得到了我想要的!

How about this? 这个怎么样?

unique(setkey(DT))[, any(duplicated(.SD)), by=id, .SDcols = c("var1", "var2")]

It takes about 140 seconds to set the key on my "slow" machine. 在我的“慢速”机器上设置密钥大约需要140秒。 And the actual grouping is still going on... :) 实际分组仍在进行... :)


This is the huge data I'm testing on: 这是我正在测试的海量数据:

set.seed(1234)
DT <- data.table(id = rep(1:4500000, each=10), 
                 var1 = sample(1000, 45000000, replace=TRUE), 
                 var2 = sample(1000, 45000000, replace=TRUE))
DT[, id2 := sample(c("A","B"), 10, TRUE), by=id]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM