简体   繁体   中英

What is the FASTEST way in R to group by a data.table column and count unique values in another column?

Background: This runs in a swapping optimization algorithm. This particular line runs in the inner while loop so it is executed a very large number of times. Everything else in the loop runs quite fast.

Example data.table "Inventory_test" created below:

NestCount2 <- c(
  "1","1","1","1","1","1","1","1","2","2","3","3","3","3","3","3",
  "3","3","3","4","4","4","5","5","5","5","5","5","5","5","5","6",
  "6","6","6","6","6","6","6","6","",""
)
Part2 <- c(
  "Shroud","Shroud","Shroud","Shroud","Shroud","Shroud","Shroud",
  "Shroud","S1Nozzle","S1Nozzle","Shroud","Shroud","Shroud","Shroud",
  "Shroud","Shroud","Shroud","Shroud","Shroud","S2Nozzle","S2Nozzle",
  "S2Nozzle","Shroud","Shroud","Shroud","Shroud","Shroud","Shroud",
  "Shroud","Shroud","Shroud","Shroud","Shroud","Shroud","Shroud",
  "Shroud","Shroud","Shroud","Shroud","Shroud","*","*"
)    
Inventory_test <- data.table(data.frame(NestCount2,Part2))
# Methods already tried (have basically exact same performance using profiler):
ptcts <- table(unique(Inventory_test[,c("Part2","NestCount2")])$Part2)
ptcts2 <- Inventory_test[, .(count = uniqueN(NestCount2)), by = Part2]$count

I've noticed (using the Rstudio profiler) that about half the time of the ptcts line is just the column indexing Inventory_test[,c("Part2","NestCount2")] . I've looked for quicker methods but haven't found any :(. Any help would be much appreciated!!

I ran some benchmarks: so far it looks like the fastest way is to not use by at all and just table() instead with Inventory_test[, rowSums(table(Part2, NestCount2) > 0L)] .

library(data.table)
library(microbenchmark)
library(ggplot2)

setkey(Inventory_test, Part2)

microbenchmark(
  unit = "relative",
  m1 = table(unique(Inventory_test[, c("Part2", "NestCount2")])$Part2),
  m2 = Inventory_test[, .(count = uniqueN(NestCount2)), by = Part2]$count,
  m3 = Inventory_test[, .N, by = .(Part2, NestCount2)][, .N, by = Part2],
  m4 = Inventory_test[, uniqueN(NestCount2), by = Part2]$V1,
  m5 = Inventory_test[, uniqueN(paste(Part2, NestCount2)), by = Part2],
  m6 = Inventory_test[, length(unique(NestCount2)), Part2],
  m7 = Inventory_test[, rowSums(table(Part2, NestCount2) > 0L)]
) -> mb

print(mb, digits = 3)
#> Unit: relative
#>  expr  min   lq mean median   uq  max neval cld
#>    m1 1.26 1.27 1.37   1.32 1.60 1.12   100  b 
#>    m2 1.28 1.18 1.29   1.16 1.20 5.93   100  b 
#>    m3 2.21 2.05 2.14   1.98 2.10 3.92   100   c
#>    m4 1.25 1.16 1.23   1.14 1.16 3.97   100 ab 
#>    m5 1.34 1.23 1.28   1.22 1.18 4.27   100 ab 
#>    m6 1.48 1.37 1.35   1.33 1.35 1.18   100  b 
#>    m7 1.00 1.00 1.00   1.00 1.00 1.00   100 a

autoplot(mb)

Created on 2018-07-27 by the reprex package (v0.2.0.9000).

PS. Interestingly data.table(data.frame(NestCount2, Part2)) is actually a little bit faster than data.table(NestCount2, Part2) . That's because data.frame() coerces the strings to factors, and these operations seem a bit faster on factors.

For once stringsAsFactors = TRUE did some good -- go figure!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM