[英]Sum shares/number of rows according to date by groups in R data.table/frame
I would like to calculate the squared sum of occurences (number of rows respectively) of the unique values of group A ( industry
) by group B ( country
) over the previous year.我想计算上一年 B 组(
country
)的 A 组( industry
)的唯一值的出现次数(分别为行数)的平方和。
Calculation example row 5 : 2x A + 1x B + 1x C = 2^2+1^2+^+1^2 = 6
(does not include the A from row 1 because it is older than a year and also not include the A from row 6 because it is in another country).计算示例第 5 行:
2x A + 1x B + 1x C = 2^2+1^2+^+1^2 = 6
(不包括第 1 行的 A,因为它早于一年,也不包括A 来自第 6 行,因为它在另一个国家)。
I manage to calculate the numbers by row but I am failing to move this to the aggregated date level:我设法按行计算数字,但未能将其移至汇总日期级别:
dt[, count_by_industry:= sapply(date, function(x) length(industry[between(date, x - lubridate::years(1), x)])),
by = c("country", "industry")]
The solution ideally scales to real data with ~2mn rows and around 10k dates and group elements (hence the data.table
tag).该解决方案理想地扩展到具有约
data.table
万行和大约 10k 个日期和组元素(因此是data.table
标签)的真实数据。
Example Data示例数据
ID <- c("1","2","3","4","5","6")
Date <- c("2016-01-02","2017-01-01", "2017-01-03", "2017-01-03", "2017-01-04","2017-01-03")
Industry <- c("A","A","B","C","A","A")
Country <- c("UK","UK","UK","UK","UK","US")
Desired <- c(1,4,3,3,6,1)
library(data.table)
dt <- data.frame(id=ID, date=Date, industry=Industry, country=Country, desired_output=Desired)
setDT(dt)[, date := as.Date(date)]
Adapting from your start:从一开始就适应:
dt[, output:= sapply(date, function(x) sum(table(industry[between(date, x - lubridate::years(1), x)]) ^ 2)),
by = c("country")]
dt
id date industry country desired_output output
1: 1 2016-01-02 A UK 1 1
2: 2 2017-01-01 A UK 4 4
3: 3 2017-01-03 B UK 3 3
4: 4 2017-01-03 C UK 3 3
5: 5 2017-01-04 A UK 6 6
6: 6 2017-01-03 A US 1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.