简体   繁体   English

R-使用data.table有效测试跨行和跨列的滚动条件

[英]R - Using data.table to efficiently test rolling conditions across multiple rows and columns

I am trying to test a variety of conditions in a data.table that looks like this reproducible example 我正在尝试在看起来像此可复制示例的data.table中测试各种条件

 set.seed(17)
 year <- 1980 + rnbinom(10000,3,0.35)
 event <- rep(LETTERS, length.out=10000)
 z <- as.integer(runif(10000,min = 0, max = 10))
 dt <- data.table(event,year,z)
 setkey(dt, event,year)
 dt <- dt[,sum(z), by=c("event","year")]

V1 (which emerges from the last command) represents a count of event occurences. V1 (从最后一条命令出现)表示事件发生的次数。

So the data table is an ordered array and I need to execute a variety of functions on it. 因此,数据表是一个有序数组,我需要在其上执行各种功能。 Here are some examples: 这里有些例子:

  1. How do I calculate a rolling sum (or rolling mean) of the occurences in 10 prior years for each event? 如何计算每个事件在过去10年中发生的滚动总和(或滚动平均值)? So for A 1990 the desired output is 1,452 (between 1980 and 1989). 因此,对于A 1990期望的输出为1,452 (1980年至1989年之间)。 For H 2012 , the output is 11 because between 2002 and 2011 there are only 11 occurences (3 in 2002, 3 in 2007, and 5 in 2010). 对于H 2012而言输出为11,因为在2002年至2011年之间只有11个事件发生(2002年为3个,2007年为3个,2010年为5个)。 For A 1983 the output is NA 对于A 1983输出为NA

  2. How can I check whether an event occurs in at least 12 out of 15 prior years? 如何检查过去15年中至少有12年是否发生过事件? So for A 1997 we can see that the event occurred in more than 12 years in the 15 years prior (1982 - 1996, it happened in every year besides 1996) thus criterium met . 因此, 对于1997年A,我们可以看到该事件发生在15年前的前12年中(1982-1996年,除了1996年以外每年都发生),因此达到了标准 However, for A 2001 we see that the event only occurs in 11 of 15 prior years (1986 - 2000), it does not happen in 1996,1998,1999,and 2000) criterium not met . 但是,对于A 2001,我们看到该事件仅发生在15个以前的年份中的11个(1986-2000)中,而在1996、1998、1999和2000年没有发生,而没有达到标准 The desired output here would be a discrete 1 (criterium met) or 0 (criterium not met) 此处所需的输出将是离散的1(满足标准)或0(不满足标准)

Ideally the code would enable the calculation of both 1 and 2 not only for years that occur in the data.table but also for those between 1980 and 2013 that are missing. 理想情况下,代码将同时启用1和2的计算不仅为years发生在data.table而且对那些1980年和2013年之间的丢失。 So for K 2005 , we can calculate the outcome for Q1 as 25 (13 + 5 + 3 + 3 + 2) (thanks @Arun for pointing the former error out). 因此,对于K 2005 ,我们可以将Q1的结果计算为25(13 + 5 + 3 + 3 + 2)(感谢@Arun指出了先前的错误)。 For Q2, we see the event does not occur in 1999,2000,2001,2003, and 2004 hence the criterium "at least in 12 out of 15 years" is not met . 对于第二季度,我们看到该事件没有在1999、2000、2001、2003和2004年发生,因此没有满足“至少在15年中的12年中”标准 Also, it is possible that the event-year combination exists in the data.table but that V1 has value 0 (see row 18, A 2001). 同样,事件-年份组合可能存在于data.table中,但V1的值为0(请参见A 2001,第18行)。 Ideally, such zero occurences would be treated as non-occurences (eg by deleting all rows for which V1 is zero). 理想情况下,此类零发生将被视为非发生(例如,通过删除V1为零的所有行)。

I know it's uncommon to post two questions but I feel they belong together and really relate to similar problems. 我知道发两个问题并不常见,但我觉得它们属于同一类,并且确实与类似的问题有关。 Hope someone can make some suggestions. 希望有人可以提出一些建议。

Thanks a lot, 非常感谢,

Simon 西蒙

For your first question: 对于第一个问题:

This'll get the running sum for years that are not necessarily in the dataset as well (as you requested just underneath the two points). 这将获得多年不一定的总和,也不一定会在数据集中出现(如您所要求的,仅在两点下方)。 The idea is to first generate all combinations of event and year - even the ones which doesn't exist in the dataset. 这个想法是首先生成eventyear所有组合,甚至是数据集中不存在的所有组合。 This can be accomplished by the function CJ (for crossjoin). 这可以通过功能CJ (用于交叉连接)来完成。 This'll, for each event , create all year . 这将为每个event创建year

setkey(dt, event, year)
d1 = CJ(event=unique(dt$event), year=min(dt$year):max(dt$year))

Now, we join back with dt to fill the missing values for V1 with NA. 现在,我们join回来dt ,以填补缺失值V1与NA。

d1 = dt[d1]

Now we've a dataset with all combinations of event and year . 现在,我们有了一个具有eventyear所有组合的数据集。 From here, we've to now find a way to perform the rolling sum. 从这里开始,我们现在必须找到一种执行滚动总和的方法。 For this, we create, yet again, another dataset, which contains all the previous 10 years, for each year, as follows: 为此,我们再次创建另一个数据集,其中包含每年的所有前10年,如下所示:

window_size = 10L
d2 = d1[, list(window = seq(year-window_size, year-1L, by=1L)), by="event,year"]

For each "event,year", we create a new column window , that'll generate the previous 10 years. 对于每个“事件,年份”,我们创建一个新的列window ,该window将生成前10年。

Now, all we've to do is to set the key columns appropriately and perform a join to get the corresponding "V1" values. 现在,我们要做的就是适当地设置key列并执行join以获取相应的“ V1”值。

setkey(d2, event, window) ## note the join here is on "event, window"
setkey(d1, event, year)

ans = d1[d2]

Now, we've the values of "V1" for each "event,window" combination. 现在,对于每个“事件,窗口”组合,我们都有“ V1”的值。 All we've to do is aggregate by "event,year.1" ("year.1" was previously "year", and "year" in ans was previously "window"). 我们要做的只是通过“ event,year.1”进行汇总(“ year.1”以前是“ year”, ans “ year”以前是“ window”)。 Here, we take care of the condition that if any of the years is < 1980, then the sum should be NA. 在这里,我们考虑的条件是,如果任何年份小于1980,则总和应为NA。 This is done by using a small hack that TRUE | NA = TRUE 这是通过使用TRUE | NA = TRUE hack来完成的TRUE | NA = TRUE TRUE | NA = TRUE and FALSE | NA = NA TRUE | NA = TRUEFALSE | NA = NA FALSE | NA = NA . FALSE | NA = NA

q1 = ans[, sum(V1, na.rm=TRUE) * (!any(year < 1980) | NA), by="event,year.1"]

q1[event == "K" & year.1 == "2005"]
#    event year.1 V1
# 1:     K   2005 25

For your second question: 对于第二个问题:

Repeat the same as above with window_size = 15L instead of 10L and get up until ans . window_size = 15L而不是10L重复上述操作,直到ans Then, we can do: 然后,我们可以这样做:

q2 = ans[!is.na(V1)][, .N, by="event,year.1"]

q2[event == "A" & year.1 == 1997]
#    event year.1  N
# 1:     A   1997 14

This is correct because dt has all years from 1982-1995, and 1996 is missing and therefore not counted => N=14 , as it should be. 这是正确的,因为dt从1982年至1995年一直都是年份,并且缺少1996年,因此没有计入=> N=14

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R // 如果满足 data.table 的其他列中的多个条件,则计算行数并求和 col 值 // 高效且快速的 data.table 解决方案 - R // count rows and sum col value if multiple conditions in other columns of a data.table are met // efficient & fast data.table solution 在 R data.table 中,根据具有多个条件的其他列中的元素有条件地删除行 - In R data.table conditionally remove rows based on elements in other columns with multiple conditions 使用data.table对多列进行滚动自定义计算 - Doing rolling custom computations using data.table for multiple columns 根据data.table中特定列的多个条件标记行 - Flag rows based on multiple conditions on specific columns in data.table 使用 data.table 包对 R 中的多个变量进行滚动平均 - rolling average to multiple variables in R using data.table package 如何使用 R 中的 data.table 对多行、多列进行平均? - How to average across several rows, for many columns, using data.table in R? R Data.table用于计算多个列的摘要统计信息 - R Data.table for computing summary stats across multiple columns 对具有多个变化条件的行求和 R data.table - Sum over rows with multiple changing conditions R data.table 如何使用data.table跨多个列(loci)按组有效地计算等位基因频率(比例) - How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci) 在多个列中汇总data.table - Summarize a data.table across multiple columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM