简体   繁体   English

如何在data.table R中的子组内订购数据

[英]How to order data within subgroups in data.table R

Consider the following: 考虑以下:

DT = data.table(a=sample(1:2), b=sample(1:1000,20))

How to display b, say the n highest values, by each a? 如何显示b,比如每个a显示n个最高值?

I am stucked in DT[,b,by=a][order(a,-b)] . 我被困在DT[,b,by=a][order(a,-b)]

Thanks! 谢谢!

The most elegant would be: 最优雅的是:

DT[order(-b),head(b,5),by=a]

In terms of pure performance: 在纯粹的表现方面:

DT[order(-b), indx := seq_len(.N), "a"][indx <= 5][,indx:=NULL][]

Or the one suggested by @Frank: 或@Frank建议的那个:

DT[DT[order(-b),.I[1:.N<=5],"a"]$V1]

Below the benchmark of all three above: 低于以上三者的基准:

# devtools::install_github("jangorecki/dwtools")
library(dwtools) # to populate complex dataset
N <- 5e6
DT <- dw.populate(N, scenario="fact")
str(DT)
#Classes ‘data.table’ and 'data.frame': 5000000 obs. of  8 variables:
# $ cust_code: chr  "id010" "id076" "id024" "id081" ...
# $ prod_code: int  8234 5689 31198 35479 39140 37589 8184 39489 35266 3596 ...
# $ geog_code: chr  "OH" "NH" "TN" "MI" ...
# $ time_code: Date, format: "2012-03-11" "2014-02-10" "2012-11-05" "2013-01-30" ...
# $ curr_code: chr  "XRP" "HRK" "CAD" "BRL" ...
# $ amount   : num  486 382 695 470 749 ...
# $ value    : num  193454 33694 351418 84888 20673 ...

By cust_code column, uniqueN equal to 100: 通过cust_code列,uniqueN等于100:

system.time(DT[order(-time_code),head(.SD,5),"cust_code"])
#   user  system elapsed 
#  1.804   0.084   1.890 
system.time(DT[order(-time_code), indx := seq_len(.N),"cust_code"][indx <= 5][,indx:=NULL][])
#   user  system elapsed 
#  1.414   0.092   1.508 
system.time(DT[DT[order(-time_code),.I[1:.N<=5],"cust_code"]$V1])
#   user  system elapsed 
#  1.405   0.096   1.502 

If there are much more groups ( prod_code column, uniqueN equal to 50000), then we can see the impact on the performance: 如果有更多的组( prod_code列,uniqueN等于50000),那么我们可以看到对性能的影响:

system.time(DT[order(time_code),head(.SD,5),"prod_code"])
#   user  system elapsed 
# 10.177   0.109  10.322
system.time(DT[order(time_code), indx := seq_len(.N),"prod_code"][indx <= 5][,indx:=NULL][])
#   user  system elapsed 
#  1.555   0.099   1.665 
system.time(DT[DT[order(time_code),.I[1:.N<=5],"prod_code"]$V1])
#   user  system elapsed 
#  1.697   0.064   1.764

Update on 2015-11-09: 2015-11-09更新:

With today's Arun commit e615532 the head and tail should be optimized under the hood. 随着今天的Arun提交e615532headtail应该在引擎盖下进行优化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM