简体   繁体   English

基于日期和其他条件的 R 中的累积总和使用 data.table

[英]Cumulative Sum in R based on Date and other conditions using data.table

I have some football data (2020/2021 Serie A), and I would like to compute the number of games each team played over the last n days (lets say to simplify over the last 30 days).我有一些足球数据(2020/2021 意甲联赛),我想计算每支球队在过去 n 天里的比赛数量(可以说是在过去 30 天内简化)。 Conditions are thus the team, the day the game is played (strictly smaller than) and that same day - 30 (greater or equal to).因此,条件是球队,比赛当天(严格小于)和同一天 - 30(大于或等于)。

I would like to know what is the best way to do that using data.table (alone), and more important, the logic behind the code.我想知道使用 data.table(单独)的最佳方法是什么,更重要的是,代码背后的逻辑。 I woudl go for a loop over the teams and dates, but I think it is cumbersome and I am sure there is a way to have it done in one single row.我会遍历团队和日期,但我认为这很麻烦,而且我相信有一种方法可以在一行中完成。

A sample is given below, with the result I would expect (day and date might seem misleading, because some games were postponed, but that is not important. Data is classified by date).下面给出了一个示例,结果是我所期望的(日期和日期可能会产生误导,因为有些比赛被推迟了,但这并不重要。数据按日期分类)。 Thank you very much.非常感谢你。

Code代码 Team团队 Date日期 Day Date - 30d日期 - 30d Games played over the last 30 days过去 30 天玩过的游戏
TORATA托拉塔 Atalanta亚特兰大 2020-09-26 2020-09-26 2 2 2020-08-27 2020-08-27 NA不适用
LAZATA拉扎塔 Atalanta亚特兰大 2020-09-30 2020-09-30 1 1 2020-08-31 2020-08-31 1 1
ATACAG ATACAG Atalanta亚特兰大 2020-10-04 2020-10-04 3 3 2020-09-04 2020-09-04 2 2
NAPATA纳帕塔 Atalanta亚特兰大 2020-10-17 2020-10-17 4 4 2020-09-17 2020-09-17 3 3
ATASAM ATASAM Atalanta亚特兰大 2020-10-24 2020-10-24 5 5 2020-09-24 2020-09-24 4 4
CROATA克罗地亚 Atalanta亚特兰大 2020-10-31 2020-10-31 6 6 2020-10-01 2020-10-01 3 3
ATAINT ATAINT Atalanta亚特兰大 2020-11-08 2020-11-08 7 7 2020-10-09 2020-10-09 3 3

Here's one implementation, just data.table and base R:这是一个实现,只是data.table和 base R:

dat[, z := sapply(Date, function(z) sum(between(z - Date, 0.1, 30)))]
dat
#      Code     Team       Date   Day Date...30d Games.played.over.the.last.30.days     z
#    <char>   <char>     <Date> <int>     <Date>                              <int> <int>
# 1: TORATA Atalanta 2020-09-26     2 2020-08-27                                 NA     0
# 2: LAZATA Atalanta 2020-09-30     1 2020-08-31                                  1     1
# 3: ATACAG Atalanta 2020-10-04     3 2020-09-04                                  2     2
# 4: NAPATA Atalanta 2020-10-17     4 2020-09-17                                  3     3
# 5: ATASAM Atalanta 2020-10-24     5 2020-09-24                                  4     4
# 6: CROATA Atalanta 2020-10-31     6 2020-10-01                                  3     3
# 7: ATAINT Atalanta 2020-11-08     7 2020-10-09                                  3     3

In this case, for each Date value, we count how many of the dates are within 30 days of it.在这种情况下,对于每个Date值,我们计算它的 30 天内有多少个日期。

If you need the NA in place of a 0 , then you can add on dat[z < 1, z := NA] or similar.如果您需要NA代替0 ,那么您可以添加dat[z < 1, z := NA]或类似的。


Data:数据:

library(data.table)
dat <- structure(list(Code = c("TORATA", "LAZATA", "ATACAG", "NAPATA", "ATASAM", "CROATA", "ATAINT"), Team = c("Atalanta", "Atalanta", "Atalanta", "Atalanta", "Atalanta", "Atalanta", "Atalanta"), Date = structure(c(18531, 18535, 18539, 18552, 18559, 18566, 18574), class = "Date"), Day = c(2L, 1L, 3L, 4L, 5L, 6L, 7L), Date...30d = structure(c(18501, 18505, 18509, 18522, 18529, 18536, 18544), class = "Date"), Games.played.over.the.last.30.days = c(NA, 1L, 2L, 3L, 4L, 3L, 3L)), class = c("data.table", "data.frame"), row.names = c(NA, -7L))
setDT(dat)

You could use runner in combination with data.table to calculate a running Date window count:您可以将runnerdata.table结合使用来计算运行Date窗口计数:

library(data.table)
library(runner)

setDT(data)

data[,Date:=as.Date(Date,'%Y-%m-%d')]

data[,N:=runner::runner(
                        x = Date, 
                        k = 30, # 30-days window
                        lag = 1,
                        idx = Date,
                        f = length)
    ,by=Team][]

     Code     Team       Date Day    Date30d Games30days N
1: TORATA Atalanta 2020-09-26   2 2020-08-27          NA 0
2: LAZATA Atalanta 2020-09-30   1 2020-08-31           1 1
3: ATACAG Atalanta 2020-10-04   3 2020-09-04           2 2
4: NAPATA Atalanta 2020-10-17   4 2020-09-17           3 3
5: ATASAM Atalanta 2020-10-24   5 2020-09-24           4 4
6: CROATA Atalanta 2020-10-31   6 2020-10-01           3 3
7: ATAINT Atalanta 2020-11-08   7 2020-10-09           3 3

Data:数据:

data <- read.table(text='
Code    Team    Date    Day     Date30d     Games30days
TORATA  Atalanta    2020-09-26  2   2020-08-27  NA
LAZATA  Atalanta    2020-09-30  1   2020-08-31  1
ATACAG  Atalanta    2020-10-04  3   2020-09-04  2
NAPATA  Atalanta    2020-10-17  4   2020-09-17  3
ATASAM  Atalanta    2020-10-24  5   2020-09-24  4
CROATA  Atalanta    2020-10-31  6   2020-10-01  3
ATAINT  Atalanta    2020-11-08  7   2020-10-09  3',header=T)

You can get this with one line of code, using a non-equi join of the table onto itself.您可以通过一行代码获得此信息,使用表的非对等连接到其自身。

Let's say fb is your input data (without the Games30days column).假设fb是您的输入数据(没有Games30days列)。 Like this:像这样:

     Code     Team       Date Day Date - 30d
1: TORATA Atalanta 2020-09-26   2 2020-08-27
2: LAZATA Atalanta 2020-09-30   1 2020-08-31
3: ATACAG Atalanta 2020-10-04   3 2020-09-04
4: NAPATA Atalanta 2020-10-17   4 2020-09-17
5: ATASAM Atalanta 2020-10-24   5 2020-09-24
6: CROATA Atalanta 2020-10-31   6 2020-10-01
7: ATAINT Atalanta 2020-11-08   7 2020-10-09

Then, just do a join on Team=Team , Date<Date , and Date>Date - 30d , like this:然后,只需加入Team=TeamDate<DateDate>Date - 30d ,如下所示:

games_played = fb[fb,on=.(Team = Team, Date<Date, Date>`Date - 30d`), nomatch=0][,.("Games30" =  .N), .(Date,Team)]

which returns返回

         Date     Team Games30
1: 2020-09-30 Atalanta       1
2: 2020-10-04 Atalanta       2
3: 2020-10-17 Atalanta       3
4: 2020-10-24 Atalanta       4
5: 2020-10-31 Atalanta       3
6: 2020-11-08 Atalanta       3

That result can easily be joined back to the original, to get all the columns, like this:该结果可以很容易地连接回原始结果,以获取所有列,如下所示:

games_played[fb, on=.(Team, Date)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据使用列的累加总和创建的分组过滤R data.table - Filter R data.table based on groupings created using cumulative sum of a column R data.table累积和函数 - R data.table cumulative sum function 用R中的data.table计算两个日期之间的累计和 - Calculate cumulative sum between two dates with a data.table in R R data.table 中的阈值 window 的累积和 - Cumulative sum with a threshold window in R data.table R data.table 时间间隔内的累积总和 - R data.table cumulative sum over time intervals R 使用 data.table 对其他列进行分组和求和 - R Group by with conditional and sum other columns using data.table R // 如果满足 data.table 的其他列中的多个条件,则计算行数并求和 col 值 // 高效且快速的 data.table 解决方案 - R // count rows and sum col value if multiple conditions in other columns of a data.table are met // efficient & fast data.table solution 使用累积总和重塑data.table - Reshaping data.table with cumulative sum 对具有多个变化条件的行求和 R data.table - Sum over rows with multiple changing conditions R data.table 在 R data.table 中,根据具有多个条件的其他列中的元素有条件地删除行 - In R data.table conditionally remove rows based on elements in other columns with multiple conditions
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM