简体   繁体   中英

What is the most efficient method for finding row indices by group in a data.table in R?

I am working with a large data.table and need to find the row numbers by group. Unfortunately, sorting the datatable is not an option as they are indexed several places (by id, time, ...) so I don't think setkey can be used.

What is the most efficient way to approach this problem?

I have currently tried which(...) , k[..., which = TRUE] , and k[, .I[...]] . Are there any quicker ways?

By benchmarking, which(...) seems more effective than k[..., which = TRUE] for smaller data.tables (less than 10 000 rows, full code below):

                        test replications elapsed relative user.self sys.self
1    k[a == x, which = TRUE]           10    2.63    1.789      2.52     0.10
2    which(k$a == x)                   10    1.47    1.000      1.47     0.00
3    setindex(k, a)                    10    2.71    1.844      2.64     0.06
4    k[, .I[a == x]]                   10    2.03    1.381      2.00     0.00

But as the row number grows, k[..., which = TRUE] is considerably quicker:

> rbenchmark::benchmark(
+   "A" = {
+     k <- data.table(
+       a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
+     )
+     u <- unique(k$a)
+     m <- lapply(u, function(x) k[a == x, which = TRUE])
+     },
+   "B" = {
+     k <- data.table(
+       a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
+     )
+     u <- unique(k$a)
+     m <- lapply(u, function(x) which(k$a == x))
+   },
+   "C" = {
+     k <- data.table(
+       a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
+     )
+     u <- unique(k$a)
+     setindex(k, a)
+     m <- lapply(u, function(x) k[a == x, which = TRUE])
+   },
+   "D" = {
+     k <- data.table(
+       a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
+     )
+     u <- unique(k$a)
+     setindex(k, a)
+     m <- lapply(u, function(x) k[, .I[a == x]])
+   },
+   replications = 10,
+   columns = c("test", "replications", "elapsed",
+               "relative", "user.self", "sys.self"))
  test replications elapsed relative user.self sys.self
1    A           10    3.64    1.000      3.61     0.08
2    B           10   43.22   11.874     42.73     0.02
3    C           10    3.70    1.016      3.72     0.04
4    D           10   46.71   12.832     46.33     0.03

Using .I , this option returns a data.table with two columns. The first column is the unique values in a , and the second is a list of indices where each value appears in k . The form is different than the OP's m , but the information is all there and just as easily accessible.

k[, .(idx = .(.I)), a]

Benchmarking:

library(data.table)

k <- data.table(a = sample(factor(seq_len(200)), size = 1e6, replace = TRUE))

microbenchmark::microbenchmark(
  A = {
    u <- unique(k$a)
    m <- lapply(u, function(x) k[a == x, which = TRUE])
  },
  B = {
    m2 <- k[, .(idx = .(.I)), a]
  },
  times = 100
)
#> Unit: milliseconds
#>  expr      min       lq      mean   median        uq      max neval
#>     A 282.0331 309.2662 335.30146 325.3355 350.51080 525.7929   100
#>     B   9.7870  10.3598  13.04379  10.8292  12.73785  65.4864   100

all.equal(m, m2$idx)
#> [1] TRUE

all.equal(u, m2$a)
#> [1] TRUE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM