[英]Finding identical rows in subgroups with data.table
我的表有兩個ID。 我想針對第一個ID的每個值,查找第二個ID的值不同的兩行是否相同(不包括第二個ID的列。)。 與我的表非常相似(但小得多)的表是:
library(data.table)
DT <- data.table(id = rep(LETTERS, each=10),
var1 = rnorm(260),
var2 = rnorm(260))
DT[, id2 := sample(c("A","B"), 10, T), by=id] # I need this to simulate different
# distribution of the id2 values, for
# each id value, like in my real table
setkey(DT, id, id2)
DT$var1[1] <- DT$var1[2] # this simulates redundances
DT$var2[1] <- DT$var2[2] # inside same id and id2
DT$var1[8] <- DT$var1[2] # this simulates two rows with different id2
DT$var2[8] <- DT$var2[2] # and same var1 and var2. I'm after such rows!
> head(DT, 10)
id var1 var2 id2
1: A 0.11641260243 0.52202152686 A
2: A 0.11641260243 0.52202152686 A
3: A -0.46631312530 1.16263285108 A
4: A -0.01301484819 0.44273945065 A
5: A 1.84623329221 -0.09284888054 B
6: A -1.29139503119 -1.90194818212 B
7: A 0.96073555968 -0.49326620160 B
8: A 0.11641260243 0.52202152686 B
9: A 0.86254993530 -0.21280899589 B
10: A 1.41142798959 1.13666002123 B
我目前正在使用此代碼:
res <- DT[, {a=unique(.SD)[,-3,with=F] # Removes redundances like in row 1 and 2
# and then removes id2 column.
!identical(a, unique(a))}, # Looks for identical rows
by=id] # (in var1 and var2 only!)
> head(res, 3)
id V1
1: A TRUE
2: B FALSE
3: C FALSE
一切似乎都可以正常工作,但是對於我的實際表(幾乎80M的行和4,5M的unique(DT$id)
),我的代碼需要2.1個小時。
有沒有人有一些技巧來加快上面的代碼? 我最終是否沒有遵循從data.table
功能中受益所需的最佳實踐? 預先感謝任何人!
編輯:
一些時間將我的代碼與@Arun的代碼進行比較:
DT <- data.table(id = rep(LETTERS,each=10000),
var1 = rnorm(260000),
var2 = rnorm(260000))
DT[, id2 := sample(c("A","B"), 10000, T), by=id] # I need this to simulate different
setkey(DT)
> system.time(unique(DT)[, any(duplicated(.SD)), by = id, .SDcols = c("var1", "var2")])
user system elapsed
0.48 0.00 0.49
> system.time(DT[, {a=unique(.SD)[,-3,with=F]
+ any(duplicated(a))},
+ by=id])
user system elapsed
1.09 0.00 1.10
我想我得到了我想要的!
這個怎么樣?
unique(setkey(DT))[, any(duplicated(.SD)), by=id, .SDcols = c("var1", "var2")]
在我的“慢速”機器上設置密鑰大約需要140秒。 實際分組仍在進行... :)
這是我正在測試的海量數據:
set.seed(1234)
DT <- data.table(id = rep(1:4500000, each=10),
var1 = sample(1000, 45000000, replace=TRUE),
var2 = sample(1000, 45000000, replace=TRUE))
DT[, id2 := sample(c("A","B"), 10, TRUE), by=id]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.