[英]Calculate number of occurrences within a specific time period
I have the folllowing data, where ID stands for an individual, Date for the date, and Purchased for whether somebody made a purchase (I made this last one so that I can count the the occurences):我有以下数据,其中 ID 代表个人,日期代表日期,购买代表是否有人购买(我做了最后一个,以便计算发生次数):
ID Date Purchased
1 1 2017-01-01 1
2 1 2017-08-03 1
3 1 2017-09-02 1
4 2 2017-09-04 1
5 2 2018-07-12 1
6 2 2018-11-03 1
7 2 2018-12-05 1
8 2 2019-01-01 1
9 3 2018-02-03 1
10 3 2020-02-03 1
11 3 2020-03-01 1
I would like to create a variable called "Frequency" that calculates the number of times an individual has made a purchase in the past year by summing up all the "Purchased" before the specific Date you see in the data frame.我想创建一个名为“Frequency”的变量,通过汇总您在数据框中看到的特定日期之前的所有“Purchased”来计算个人在过去一年中购买的次数。
So for example, for row 3 this would lead to a "Frequency" of 2 since 2017-01-01
and 2017-08-03
are both within a one-year time period from 2017-09-02
(so within the interval of 2016-09-02
and 2017-09-01
).因此,例如,对于第 3 行,这将导致“频率”为 2,因为2017-01-01
和2017-08-03
都在2017-09-02
的一年时间段内(因此在2016-09-02
年 9 月 2 日和2017-09-01
年 9 月 1 日)。
See desired output:请参阅所需的 output:
ID Date Purchased Frequency
1 1 2017-01-01 1 0
2 1 2017-08-03 1 1
3 1 2017-09-02 1 2
4 2 2017-09-04 1 0
5 2 2018-07-12 1 1
6 2 2018-11-03 1 1
7 2 2018-12-05 1 2
8 2 2019-01-01 1 3
9 3 2018-02-03 1 0
10 3 2020-02-03 1 0
11 3 2020-03-01 1 1
To reproduce the dataframe:要重现 dataframe:
df <- data.frame(ID = c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3), Date = as.Date(c('2017-01-01', '2017-08-03', '2017-09-02', '2017-09-04', '2018-07-12', '2018-11-03', '2018-12-05', '2019-01-01', '2018-02-03', '2020-02-03', '2020-03-01')), Purchased = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ))
I've searched on stackoverlow but haven't been able to find an answer yet that I'm able to apply to my situation and obtain the desired results.我已经在 stackoverlow 上进行了搜索,但还没有找到可以应用于我的情况并获得所需结果的答案。 One of the things that I found and tried was this:我发现并尝试过的其中一件事是:
df$frequency <-
sapply(df$Date, function(x){
sum(df$Date < x & df$Date >= x - 365)
})
I believe this might give me the results I want if I can find a way to include that it groups by ID (so it sums per ID and not overall).我相信这可能会给我我想要的结果,如果我能找到一种方法来包含它按 ID 分组(所以它是每个 ID 的总和而不是整体)。 Can't say for sure of course since I haven't been able to test it out.当然不能肯定地说,因为我无法测试它。 Any help is much appreciated.任何帮助深表感谢。
Here's a tidyverse
solution:这是一个tidyverse
解决方案:
library(dplyr)
library(purrr)
library(lubridate)
df %>%
group_by(ID) %>%
mutate(Frequency = map_dbl(Date,
~sum(Purchased[between(Date, .x - years(1), .x - 1)]))) %>%
ungroup
# ID Date Purchased Frequency
# <dbl> <date> <dbl> <dbl>
# 1 1 2017-01-01 1 0
# 2 1 2017-08-03 1 1
# 3 1 2017-09-02 1 2
# 4 2 2017-09-04 1 0
# 5 2 2018-07-12 1 1
# 6 2 2018-11-03 1 1
# 7 2 2018-12-05 1 2
# 8 2 2019-01-01 1 3
# 9 3 2018-02-03 1 0
#10 3 2020-02-03 1 0
#11 3 2020-03-01 1 1
The logic of the code is for every Date
in each ID
it sum
s the Purchased
value between current date - 1 year and current date - 1 day.代码的逻辑是对于每个ID
中的每个Date
,它sum
当前日期 - 1 年和当前日期 - 1 天之间的已Purchased
值。
You could use non-equi joins with data.table
:您可以将非 equi 连接与data.table
一起使用:
library(data.table)
setDT(df)
df[,c("Date","Before"):=.(as.Date(Date),as.Date(Date)-365)]
df[df,.(ID, Date),on=.(ID=ID, Date>=Before, Date<=Date)][,.N-1,by=.(ID,Date)]
ID Date V1
1: 1 2017-01-01 0
2: 1 2017-08-03 1
3: 1 2017-09-02 2
4: 2 2017-09-04 0
5: 2 2018-07-12 1
6: 2 2018-11-03 1
7: 2 2018-12-05 2
8: 2 2019-01-01 3
9: 3 2018-02-03 0
10: 3 2020-02-03 0
11: 3 2020-03-01 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.