I am working with a large data.table and need to find the row numbers by group. Unfortunately, sorting the datatable is not an option as they are indexed several places (by id, time, ...) so I don't think setkey
can be used.
What is the most efficient way to approach this problem?
I have currently tried which(...)
, k[..., which = TRUE]
, and k[, .I[...]]
. Are there any quicker ways?
By benchmarking, which(...)
seems more effective than k[..., which = TRUE]
for smaller data.tables (less than 10 000 rows, full code below):
test replications elapsed relative user.self sys.self
1 k[a == x, which = TRUE] 10 2.63 1.789 2.52 0.10
2 which(k$a == x) 10 1.47 1.000 1.47 0.00
3 setindex(k, a) 10 2.71 1.844 2.64 0.06
4 k[, .I[a == x]] 10 2.03 1.381 2.00 0.00
But as the row number grows, k[..., which = TRUE]
is considerably quicker:
> rbenchmark::benchmark(
+ "A" = {
+ k <- data.table(
+ a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
+ )
+ u <- unique(k$a)
+ m <- lapply(u, function(x) k[a == x, which = TRUE])
+ },
+ "B" = {
+ k <- data.table(
+ a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
+ )
+ u <- unique(k$a)
+ m <- lapply(u, function(x) which(k$a == x))
+ },
+ "C" = {
+ k <- data.table(
+ a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
+ )
+ u <- unique(k$a)
+ setindex(k, a)
+ m <- lapply(u, function(x) k[a == x, which = TRUE])
+ },
+ "D" = {
+ k <- data.table(
+ a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
+ )
+ u <- unique(k$a)
+ setindex(k, a)
+ m <- lapply(u, function(x) k[, .I[a == x]])
+ },
+ replications = 10,
+ columns = c("test", "replications", "elapsed",
+ "relative", "user.self", "sys.self"))
test replications elapsed relative user.self sys.self
1 A 10 3.64 1.000 3.61 0.08
2 B 10 43.22 11.874 42.73 0.02
3 C 10 3.70 1.016 3.72 0.04
4 D 10 46.71 12.832 46.33 0.03
Using .I
, this option returns a data.table
with two columns. The first column is the unique values in a
, and the second is a list of indices where each value appears in k
. The form is different than the OP's m
, but the information is all there and just as easily accessible.
k[, .(idx = .(.I)), a]
Benchmarking:
library(data.table)
k <- data.table(a = sample(factor(seq_len(200)), size = 1e6, replace = TRUE))
microbenchmark::microbenchmark(
A = {
u <- unique(k$a)
m <- lapply(u, function(x) k[a == x, which = TRUE])
},
B = {
m2 <- k[, .(idx = .(.I)), a]
},
times = 100
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> A 282.0331 309.2662 335.30146 325.3355 350.51080 525.7929 100
#> B 9.7870 10.3598 13.04379 10.8292 12.73785 65.4864 100
all.equal(m, m2$idx)
#> [1] TRUE
all.equal(u, m2$a)
#> [1] TRUE
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.