简体   繁体   English

使用data.table优化计数一个变量的唯一值的数量

[英]Optimizing counting the number of unique values of one variable by another with data.table

I am trying to find the number of occurrences of unique values for one variable, x, for each group, defined by variable/key y. 我试图找到由变量/键y定义的每个组的一个变量x的唯一值的出现次数。

I have been using the following code: 我一直在使用以下代码:

 DT[,length(unique(x)),by=y] -> x_count_per_y

This works, but is somewhat slow. 这可行,但是有点慢。 is there a way to optimize this for data.table, or is this the fastest I should expect? 有没有一种方法可以针对data.table对此进行优化,还是我应该期望的最快?

Use uniqueN from from data.table 1.9.5 version. 使用uniqueN 1.9.5版本中的uniqueN。
It should be also possible in 1.9.4 using 在1.9.4中使用

uniqueN <- function(x) length(attr(data.table:::forderv(x, retGrp=TRUE),"starts",TRUE))

To use it programmatically 以编程方式使用它

byvar = "y"
countvar = "x"
DT[, uniqueN(.SD), by=byvar, .SDcols=countvar]

The timings below: 计时如下:

library(data.table)
library(microbenchmark)
N <- 1e6
DT <- data.table(x = sample(1e5,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
#                                         expr      min       lq     mean   median       uq      max neval
#                   DT[, length(unique(x)), y] 85.58602 85.58602 85.58602 85.58602 85.58602 85.58602     1
#                          DT[, uniqueN(x), y] 92.71877 92.71877 92.71877 92.71877 92.71877 92.71877     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 97.51024 97.51024 97.51024 97.51024 97.51024 97.51024     1
N <- 1e7
DT <- data.table(x = sample(1e5,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
#                                         expr       min        lq      mean    median        uq       max neval
#                   DT[, length(unique(x)), y] 1642.5212 1642.5212 1642.5212 1642.5212 1642.5212 1642.5212     1
#                          DT[, uniqueN(x), y]  843.0670  843.0670  843.0670  843.0670  843.0670  843.0670     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"]  804.7881  804.7881  804.7881  804.7881  804.7881  804.7881     1
N <- 1e7
DT <- data.table(x = sample(1e6,N,TRUE), y = sample(1e5,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: seconds
#                                         expr      min       lq     mean   median       uq      max neval
#                   DT[, length(unique(x)), y] 3.025365 3.025365 3.025365 3.025365 3.025365 3.025365     1
#                          DT[, uniqueN(x), y] 4.734323 4.734323 4.734323 4.734323 4.734323 4.734323     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 5.905721 5.905721 5.905721 5.905721 5.905721 5.905721     1
N <- 1e7
DT <- data.table(x = sample(1e3,N,TRUE), y = sample(1e5,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: seconds
#                                         expr      min       lq     mean   median       uq      max neval
#                   DT[, length(unique(x)), y] 2.906589 2.906589 2.906589 2.906589 2.906589 2.906589     1
#                          DT[, uniqueN(x), y] 4.731925 4.731925 4.731925 4.731925 4.731925 4.731925     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 7.084020 7.084020 7.084020 7.084020 7.084020 7.084020     1
N <- 1e7
DT <- data.table(x = sample(1e6,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
#                                         expr      min       lq     mean   median       uq      max neval
#                   DT[, length(unique(x)), y] 1331.244 1331.244 1331.244 1331.244 1331.244 1331.244     1
#                          DT[, uniqueN(x), y]  998.040  998.040  998.040  998.040  998.040  998.040     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 1096.867 1096.867 1096.867 1096.867 1096.867 1096.867     1

A lot depends on the data, but I've filled an issue to take a look at those timings. 很大程度上取决于数据,但是我填补了一个问题以了解这些时间安排。 One more for characters: 角色的另一种:

N <- 1e7
DT <- data.table(x = sample(letters,N,TRUE), y = sample(letters[1:10],N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
#                                         expr       min        lq      mean    median        uq       max neval
#                   DT[, length(unique(x)), y] 1304.4865 1304.4865 1304.4865 1304.4865 1304.4865 1304.4865     1
#                          DT[, uniqueN(x), y]  573.8628  573.8628  573.8628  573.8628  573.8628  573.8628     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"]  528.3269  528.3269  528.3269  528.3269  528.3269  528.3269     1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 修改 data.table 以仅包含一个变量的唯一值 - Modifying a data.table to only include unique values of one variable r-dplyr:计算同一数据帧中另一个变量的每个唯一值的一个变量中唯一值的频率 - r - dplyr: counting the frequency of unique values in one variable for each unique value of another variable in the same data frame R data.table中的条件唯一计数 - Conditional Unique Counting in R data.table 在大型数据表中快速查找唯一值的数量 - Finding number of unique values (quickly) across a large data.table 创建data.table列出由另一个变量分组的一个变量的值 - Create data.table listing values of one variable grouped by another variable 根据 data.table 中的先前值和另一个变量填充变量 - Populating a variable based on previous values and another variable in a data.table 融合具有列表数据类型的data.table(获取另一个列中每个唯一值的一个列中的值列表) - Melt a data.table with a list data type (get lists of values in one col for each unique value in another col) 将重复值设置为由 data.table 中的另一个变量分组的 NA - Set the duplicated values as NA grouped by another variable in data.table 如何用相同维度的另一个数据表的值替换一个数据表中的某个值 - How to replace a certain value in one data.table with values of another data.table of same dimension 在 data.table 中按组计算每个唯一年份的观察值 - Counting observations per unique year in group in data.table
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM