[英]Calculate average of variable between two dates based on a different date column
[英]How do I calculate the average of a variable between two date ranges using a loop or apply function?
我正在尝试计算日期范围(例如,从 7 月 21 日到 7 月 28 日)之间设备计数的平均值。
所以这就是我的数据的样子,例如:
# A tibble: 580,742 x 14
country_region_~ country_region sub_region_1 sub_region_2 census_fips_code date
<chr> <chr> <chr> <chr> <chr> <date>
1 US United States NA NA NA 2020-02-15
2 US United States NA NA NA 2020-02-16
3 US United States NA NA NA 2020-02-17
4 US United States NA NA NA 2020-02-18
5 US United States NA NA NA 2020-02-19
6 US United States NA NA NA 2020-02-20
7 US United States NA NA NA 2020-02-21
8 US United States NA NA NA 2020-02-22
9 US United States NA NA NA 2020-02-23
10 US United States NA NA NA 2020-02-24
# ... with 580,732 more rows, and 8 more variables:
# retail_and_recreation_percent_change_from_baseline <dbl>,
# grocery_and_pharmacy_percent_change_from_baseline <dbl>,
# parks_percent_change_from_baseline <dbl>,
# transit_stations_percent_change_from_baseline <dbl>,
# workplaces_percent_change_from_baseline <dbl>,
# residential_percent_change_from_baseline <dbl>, date2 <date>, date3 <date>
我能够使用以下代码手动计算日期范围之间的平均值:
library(dplyr)
retailavg <- google.mobility %>%
mutate(weekrange = date >= "2020-02-15" & date <= "2020-02-21") %>%
filter(weekrange) %>%
group_by(sub_region_2) %>%
summarise(avgretail = mean(retail_and_recreation_percent_change_from_baseline))
循环是我最糟糕的噩梦,但如果有任何方法可以创建循环/应用,这样我就不必手动执行每个日期范围,那肯定会有所帮助! 我是一个绝对的初学者,所以任何建议都会有所帮助!
我不知道您的所有摘要期间是否会在一周中的同一天(“周对齐”)对齐,所以我已经回答了这两个问题。 坦率地说,即使一切都完美地与周对齐,也可以使用非周对齐的答案,所以如果您认为需要灵活性,请使用它。
如果它总是“按周”(无论它们在哪一天对齐),那么您可以简单地计算周数并按该变量分组。
library(dplyr)
dat %>%
mutate(week = as.integer(date - as.Date("2020-02-15")) %/% 7) %>%
group_by(week) %>%
summarize(
startdate = min(date), enddate = max(date),
avgval = mean(val)
)
# # A tibble: 53 x 4
# week startdate enddate avgval
# <dbl> <date> <date> <dbl>
# 1 -7 2020-01-01 2020-01-03 0.525
# 2 -6 2020-01-04 2020-01-10 0.568
# 3 -5 2020-01-11 2020-01-17 0.460
# 4 -4 2020-01-18 2020-01-24 0.657
# 5 -3 2020-01-25 2020-01-31 0.468
# 6 -2 2020-02-01 2020-02-07 0.494
# 7 -1 2020-02-08 2020-02-14 0.444
# 8 0 2020-02-15 2020-02-20 0.391
# 9 1 2020-02-22 2020-02-28 0.472
# 10 2 2020-02-29 2020-03-06 0.502
# # ... with 43 more rows
其中的诀窍是我们将周滚动调整到任意日期(这里是您的"2020-02-15"
),这样一周中的那一天和这一年的每一天的重复都将代表每个日期的开始窗户。 这是其中的示例:
dat %>%
mutate(week = as.integer(date - as.Date("2020-02-15")) %/% 7) %>%
group_by(week) %>%
filter(week == 0 | (week == -1 & row_number() == n()) | (week == 1 & row_number() == 1))
# # A tibble: 17 x 3
# # Groups: week [3]
# date val week
# <date> <dbl> <dbl>
# 1 2020-02-14 0.814 -1
# 2 2020-02-15 0.130 0
# 3 2020-02-15 0.811 0
# 4 2020-02-15 0.0691 0
# 5 2020-02-16 0.476 0
# 6 2020-02-16 0.537 0
# 7 2020-02-16 0.207 0
# 8 2020-02-18 0.210 0
# 9 2020-02-18 0.521 0
# 10 2020-02-18 0.998 0
# 11 2020-02-18 0.946 0
# 12 2020-02-18 0.309 0
# 13 2020-02-18 0.440 0
# 14 2020-02-18 0.0271 0
# 15 2020-02-20 0.148 0
# 16 2020-02-20 0.0295 0
# 17 2020-02-22 0.972 1
在这里,您可以看到0
组包含"2020-02-15"
到"2020-02-21"
(即使此随机数据中没有02-21
)。 这里的实际数字 -1、0、1 是完全任意的,我们只是将它们用作分组属性。
这可以在没有循环的情况下使用“non-equi”或“range”连接来完成。 不幸的是, dplyr
本身并不支持它(尽管它通过dbplyr::sql_on
间接支持它),但这里有一些替代方案: data.table
、 sqldf
和fuzzyjoin
(使用dplyr
):
library(data.table)
datDT <- as.data.table(dat)
ranges <- data.table(
date = as.Date(c("2020-02-15", "2020-03-01", "2020-09-14")),
enddate = as.Date(c("2020-02-21", "2020-03-05", "2020-09-30"))
)
ranges
# date enddate
# 1: 2020-02-15 2020-02-21
# 2: 2020-03-01 2020-03-05
# 3: 2020-09-14 2020-09-30
datDT[ranges, on = .(date >= date, date <= enddate)] %>%
.[, .(enddate = max(date), avgval = mean(val)), by = .(date)]
# date enddate avgval
# 1: 2020-02-15 2020-02-15 0.390515534
# 2: 2020-03-01 2020-03-01 0.533702911
# 3: 2020-09-14 2020-09-14 0.479576581
(范围的第一行故意与上面相同,显示相同的平均值0.391
。)这会影响ranges
的左连接; 如果dplyr
支持非对等连接,那么它将是left_join(ranges, dat, ...)
。 (事实上,看看这个答案底部的fuzzyjoin
选项。)
相似地,
# library(sqldf)
sqldf::sqldf("
select r.date, r.enddate, avg(val) as avgval
from ranges r
left join dat d on r.date <= d.date and r.enddate >= d.date
group by r.date")
# date enddate avgval
# 1 2020-02-15 2020-02-21 0.390515534
# 2 2020-03-01 2020-03-05 0.533702911
# 3 2020-09-14 2020-09-30 0.479576581
最后,您可以使用fuzzyjoin
:
fuzzyjoin::fuzzy_left_join(
ranges, dat, by = c("date" = "date", "enddate" = "date"),
match_fun = list(`<=`, `>=`)) %>%
group_by(date = date.x) %>%
summarize(enddate = max(enddate), dateavgval = mean(val))
# # A tibble: 3 x 3
# date enddate dateavgval
# <date> <date> <dbl>
# 1 2020-02-15 2020-02-21 0.391
# 2 2020-03-01 2020-03-05 0.534
# 3 2020-09-14 2020-09-30 0.480
数据:
set.seed(42)
dat <- data.frame(
date = as.Date("2020-01-01") + sample(365, size = 1000, replace = TRUE) - 1,
val = runif(1000)
)
dat <- dat[order(dat$date),]
str(dat)
# 'data.frame': 1000 obs. of 2 variables:
# $ date: Date, format: "2020-01-01" "2020-01-01" "2020-01-02" "2020-01-02" ...
# $ val : num 0.517 0.184 0.255 0.845 0.839 ...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.