[英]R - Using data.table to efficiently test rolling conditions across multiple rows and columns
I am trying to test a variety of conditions in a data.table that looks like this reproducible example 我正在尝试在看起来像此可复制示例的data.table中测试各种条件
set.seed(17)
year <- 1980 + rnbinom(10000,3,0.35)
event <- rep(LETTERS, length.out=10000)
z <- as.integer(runif(10000,min = 0, max = 10))
dt <- data.table(event,year,z)
setkey(dt, event,year)
dt <- dt[,sum(z), by=c("event","year")]
V1
(which emerges from the last command) represents a count of event occurences. V1
(从最后一条命令出现)表示事件发生的次数。
So the data table is an ordered array and I need to execute a variety of functions on it. 因此,数据表是一个有序数组,我需要在其上执行各种功能。 Here are some examples:
这里有些例子:
How do I calculate a rolling sum (or rolling mean) of the occurences in 10 prior years for each event? 如何计算每个事件在过去10年中发生的滚动总和(或滚动平均值)? So for A 1990 the desired output is 1,452 (between 1980 and 1989).
因此,对于A 1990 , 期望的输出为1,452 (1980年至1989年之间)。 For H 2012 , the output is 11 because between 2002 and 2011 there are only 11 occurences (3 in 2002, 3 in 2007, and 5 in 2010).
对于H 2012而言 , 输出为11,因为在2002年至2011年之间只有11个事件发生(2002年为3个,2007年为3个,2010年为5个)。 For A 1983 the output is
NA
对于A 1983 , 输出为
NA
How can I check whether an event occurs in at least 12 out of 15 prior years? 如何检查过去15年中至少有12年是否发生过事件? So for A 1997 we can see that the event occurred in more than 12 years in the 15 years prior (1982 - 1996, it happened in every year besides 1996) thus criterium met .
因此, 对于1997年的A,我们可以看到该事件发生在15年前的前12年中(1982-1996年,除了1996年以外每年都发生),因此达到了标准 。 However, for A 2001 we see that the event only occurs in 11 of 15 prior years (1986 - 2000), it does not happen in 1996,1998,1999,and 2000) criterium not met .
但是,对于A 2001,我们看到该事件仅发生在15个以前的年份中的11个(1986-2000)中,而在1996、1998、1999和2000年没有发生,而没有达到标准 。 The desired output here would be a discrete 1 (criterium met) or 0 (criterium not met)
此处所需的输出将是离散的1(满足标准)或0(不满足标准)
Ideally the code would enable the calculation of both 1 and 2 not only for years
that occur in the data.table
but also for those between 1980 and 2013 that are missing. 理想情况下,代码将同时启用1和2的计算不仅为
years
发生在data.table
而且对那些1980年和2013年之间的丢失。 So for K 2005 , we can calculate the outcome for Q1 as 25 (13 + 5 + 3 + 3 + 2) (thanks @Arun for pointing the former error out). 因此,对于K 2005 ,我们可以将Q1的结果计算为25(13 + 5 + 3 + 3 + 2)(感谢@Arun指出了先前的错误)。 For Q2, we see the event does not occur in 1999,2000,2001,2003, and 2004 hence the criterium "at least in 12 out of 15 years" is not met .
对于第二季度,我们看到该事件没有在1999、2000、2001、2003和2004年发生,因此没有满足“至少在15年中的12年中”的标准 。 Also, it is possible that the event-year combination exists in the data.table but that V1 has value 0 (see row 18, A 2001).
同样,事件-年份组合可能存在于data.table中,但V1的值为0(请参见A 2001,第18行)。 Ideally, such zero occurences would be treated as non-occurences (eg by deleting all rows for which V1 is zero).
理想情况下,此类零发生将被视为非发生(例如,通过删除V1为零的所有行)。
I know it's uncommon to post two questions but I feel they belong together and really relate to similar problems. 我知道发两个问题并不常见,但我觉得它们属于同一类,并且确实与类似的问题有关。 Hope someone can make some suggestions.
希望有人可以提出一些建议。
Thanks a lot, 非常感谢,
Simon 西蒙
This'll get the running sum for years that are not necessarily in the dataset as well (as you requested just underneath the two points). 这将获得多年不一定的总和,也不一定会在数据集中出现(如您所要求的,仅在两点下方)。 The idea is to first generate all combinations of
event
and year
- even the ones which doesn't exist in the dataset. 这个想法是首先生成
event
和year
所有组合,甚至是数据集中不存在的所有组合。 This can be accomplished by the function CJ
(for crossjoin). 这可以通过功能
CJ
(用于交叉连接)来完成。 This'll, for each event
, create all year
. 这将为每个
event
创建year
。
setkey(dt, event, year)
d1 = CJ(event=unique(dt$event), year=min(dt$year):max(dt$year))
Now, we join
back with dt
to fill the missing values for V1
with NA. 现在,我们
join
回来dt
,以填补缺失值V1
与NA。
d1 = dt[d1]
Now we've a dataset with all combinations of event
and year
. 现在,我们有了一个具有
event
和year
所有组合的数据集。 From here, we've to now find a way to perform the rolling sum. 从这里开始,我们现在必须找到一种执行滚动总和的方法。 For this, we create, yet again, another dataset, which contains all the previous 10 years, for each year, as follows:
为此,我们再次创建另一个数据集,其中包含每年的所有前10年,如下所示:
window_size = 10L
d2 = d1[, list(window = seq(year-window_size, year-1L, by=1L)), by="event,year"]
For each "event,year", we create a new column window
, that'll generate the previous 10 years. 对于每个“事件,年份”,我们创建一个新的列
window
,该window
将生成前10年。
Now, all we've to do is to set the key
columns appropriately and perform a join
to get the corresponding "V1" values. 现在,我们要做的就是适当地设置
key
列并执行join
以获取相应的“ V1”值。
setkey(d2, event, window) ## note the join here is on "event, window"
setkey(d1, event, year)
ans = d1[d2]
Now, we've the values of "V1" for each "event,window" combination. 现在,对于每个“事件,窗口”组合,我们都有“ V1”的值。 All we've to do is aggregate by "event,year.1" ("year.1" was previously "year", and "year" in
ans
was previously "window"). 我们要做的只是通过“ event,year.1”进行汇总(“ year.1”以前是“ year”,
ans
“ year”以前是“ window”)。 Here, we take care of the condition that if any of the years is < 1980, then the sum should be NA. 在这里,我们考虑的条件是,如果任何年份小于1980,则总和应为NA。 This is done by using a small hack that
TRUE | NA = TRUE
这是通过使用
TRUE | NA = TRUE
hack来完成的TRUE | NA = TRUE
TRUE | NA = TRUE
and FALSE | NA = NA
TRUE | NA = TRUE
和FALSE | NA = NA
FALSE | NA = NA
. FALSE | NA = NA
。
q1 = ans[, sum(V1, na.rm=TRUE) * (!any(year < 1980) | NA), by="event,year.1"]
q1[event == "K" & year.1 == "2005"]
# event year.1 V1
# 1: K 2005 25
Repeat the same as above with window_size = 15L
instead of 10L and get up until ans
. 用
window_size = 15L
而不是10L重复上述操作,直到ans
。 Then, we can do: 然后,我们可以这样做:
q2 = ans[!is.na(V1)][, .N, by="event,year.1"]
q2[event == "A" & year.1 == 1997]
# event year.1 N
# 1: A 1997 14
This is correct because dt
has all years from 1982-1995, and 1996 is missing and therefore not counted => N=14
, as it should be. 这是正确的,因为
dt
从1982年至1995年一直都是年份,并且缺少1996年,因此没有计入=> N=14
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.