简体   繁体   English

按年分组data.table结果的有效方法

[英]Efficient way to group data.table results by year

I am looking for advice as to whether I am using data.table efficiently. 我正在寻找关于我是否有效使用data.table建议。

I have a data set which describes incidents, with one row per incident. 我有一个描述事件的数据集,每个事件有一行。 On each row I have the date of the incident. 在每一行我都有事件发生的日期。 Right now I just want to count how many incidents there are per year. 现在我只想计算每年有多少事件。 I have done this using the code below, but it feels inefficient. 我使用下面的代码完成了这项工作,但感觉效率低下。 I would appreciate any advice on how to improve this. 我很感激有关如何改进这一点的任何建议。 (The data set is far bigger than shown below and I also have to do other similar, but more complex counts) (数据集远远大于下图所示,我还必须做其他类似但更复杂的计数)

Create a list of dates from 2000 until end 2012: 创建从2000年到2012年底的日期列表:

dates <- seq(as.Date("1/1/2000", format="%d/%m/%Y"), 
  as.Date("31/12/2012", format="%d/%m/%Y"), 
  "day")

# Choose one million occurrences on various dates:    

sampleDate <- sample(dates, 1000000, replace=TRUE)

# Create `data.table`, one line per incident:   

library(data.table)
DT.dt <- data.table(Date=sampleDate, incident=1)

# Time how long it takes to count the number of indidents in each year: 

system.time(result <- DT.dt[,count(format(Date,"%Y"))])

user  system elapsed 
11.83    0.10   11.95 

result[1:3,]
x    freq
2000 76930
2001 77101
2002 76666

So it works (I think) but I suspect there is a more efficient solution... 所以它有效(我认为),但我怀疑有一个更有效的解决方案......

When you are doing aggregate operations (grouping) with data.tables , especially for large data sets, you should set the field you are grouping by as a key (using setkeyv(DT, "your_key_field") , etc...). 当您使用data.tables进行聚合操作(分组)时, 尤其是对于大型数据集,您应该将要分组的字段设置为key (使用setkeyv(DT, "your_key_field")等等)。 Also, I can't speak definitively on the topic, but generally I think you will get better performance from using native data.table:: functions / operations within your data.table object than you would when using other packages' functions, like plyr::count for example. 另外,我不能肯定地谈论这个主题,但一般来说我认为你在data.table对象中使用本机data.table:: functions / operations会比使用其他包的函数时plyr::count ,比如plyr::count Below, I made a few data.table objects - the first is identical to your example; 下面,我做了一些data.table对象 - 第一个与你的例子相同; the second adds a column Year (instead of calculating format(Date,"%Y") at the time of function execution), but sets Date as the key ; 第二个添加一列Year (而不是在执行函数时计算format(Date,"%Y") ),但将Date设置为key ; and the third is the same as the second, except that it uses Year as the key . 第三个与第二个相同,只是它使用Year作为key I also made a few functions (for benchmarking convenience) that do the grouping in different ways. 我还提出了一些以不同方式进行分组的功能(用于基准测试)。

library(data.table)
library(plyr) # for 'count' function
library(microbenchmark)
##
dates <- seq.Date(
  from=as.Date("2000-01-01"),
  to=as.Date("2012-12-31"),
  by="day")
##
set.seed(123)
sampleDate <- sample(
  dates,
  1e06,
  replace=TRUE)
##
DT.dt <- data.table(
  Date=sampleDate,
  incident=1)
##
DT.dt2 <- copy(DT.dt)
DT.dt2[,Year:=format(Date,"%Y")]
setkeyv(DT.dt2,"Date")
##
DT.dt3 <- copy(DT.dt2)
setkeyv(DT.dt3,"Year")
##
> head(DT.dt,3)
         Date incident
1: 2003-09-27        1
2: 2010-04-01        1
3: 2005-04-26        1
> head(DT.dt2,3)
         Date incident Year
1: 2000-01-01        1 2000
2: 2000-01-01        1 2000
3: 2000-01-01        1 2000
> head(DT.dt3,3)
         Date incident Year
1: 2000-01-01        1 2000
2: 2000-01-01        1 2000
3: 2000-01-01        1 2000

## your original method
f1 <- function(dt)
{
  dt[,count(format(Date,"%Y"))]
}
## your method - using 'Year' column
f1.2 <- function(dt)
{
  dt[,count(Year)]
}
## use 'Date' column; '.N' and 
## 'by=' instead of 'count'
f2 <- function(dt)
{
  dt[,.N,by=format(Date,"%Y")]
}
## use 'Year' and '.N','by='
f3 <- function(dt)
{
  dt[,.N,by=Year]
}
##
Res <- microbenchmark(
  f1(DT.dt),
  f1.2(DT.dt2),
  f1.2(DT.dt3),
  f2(DT.dt2),
  f3(DT.dt3))
##
> Res
Unit: milliseconds
         expr        min         lq     median         uq      max neval
    f1(DT.dt) 478.941767 515.144253 557.428159 585.579862 706.8724   100
 f1.2(DT.dt2)  98.722062 115.588034 126.332104 137.792116 223.4967   100
 f1.2(DT.dt3)  97.475673 118.134788 125.836817 136.136156 238.2697   100
   f2(DT.dt2) 352.767219 373.337958 387.759996 429.301164 542.1674   100
   f3(DT.dt3)   7.912803   8.441159   8.736887   9.685267  76.9629   100

Observations: 观察:

  1. Grouping by the precalculated field Year instead of calculating format(Date,"%Y") at execution time was a definite improvement - for both of the count and .N approaches. 通过预先计算出的场分组Year ,而不是计算format(Date,"%Y")在执行时是一个明确的改善-对双方的count.N方法。 You can see this by comparing the f1() and f2() times to the f1.2() times. 你可以通过将f1()f2()次数与f1.2()次数进行比较来看到这一点。

  2. The count approach seemed to be a little slower than the .N & 'by=' approach ( f1() compared to f2() . count方法似乎比.N &'by ='方法慢一点( f1()f2()相比。

  3. The best approach by far was to use the precalculated field Year and the native data.table grouping .N & by= ; 到目前为止,最好的方法是使用预先计算的字段Year和本地data.table分组.Nby= ; f3() was considerably faster than the other four timings. f3()比其他四个时间要快得多。

There are some pretty experience data.table users on SO, certainly more so than myself, so there may be an even faster way to do this. 在SO上有一些非常有经验的data.table用户,当然比我自己更多,所以可能有更快的方法来做到这一点。 All else aside, though, it's definitely a good idea to set a key on your data.table ; 除此之外,在data.table上设置key肯定是个好主意。 and it certainly seems like you would be much better off precalculating a field like Year than doing so "on the fly"; 而且看起来你在预计像Year这样的领域比在“飞行中”这样做要好得多; you can always delete it afterwards if you don't need it by using DT.dt[,Year:=NULL] . 如果您不需要使用DT.dt[,Year:=NULL]您可以随后将其删除。

Also, you said you are trying to count the number of incident s per year - and since your example data had incident = 1 for all rows, counting was the same as summing. 此外,您说您正在尝试计算每年incident的数量 - 并且由于您的示例数据对于所有行都有incident = 1 ,因此计数与求和相同。 But assuming your real data has different values of incident , you could so something like this: 但假设您的真实数据具有不同的incident值,您可以这样:

> DT.dt3[,list(Incidents=sum(incident)),by=Year]
    Year Incidents
 1: 2000     77214
 2: 2001     77385
 3: 2002     77080
 4: 2003     76609
 5: 2004     77197
 6: 2005     76994
 7: 2006     76560
 8: 2007     76904
 9: 2008     76786
10: 2009     76765
11: 2010     76675
12: 2011     76868
13: 2012     76963

(where I called setkeyv(DT.dt3,cols="Year") above). (上面我称之为setkeyv(DT.dt3,cols="Year") )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM