[英]Group_by and between date summary in R
fellow R users! R 用户!
I've spent the last 2 hours banging my head on this problem and I couldn't get any solution.在过去的 2 个小时里,我一直在努力解决这个问题,但我找不到任何解决方案。
Intro:介绍:
I'm working on a covid dataset and I have to compute the incidence and prevalence on a week basing and on a given place.我正在研究一个 covid 数据集,我必须根据给定地点计算一周的发病率和流行率。
The incidence is easy, my code is something like:发生率很容易,我的代码是这样的:
I created a column "week" that assigns the date into a week with this function:我创建了一个列“week”,用这个 function 将日期分配到一周:
floor_date_by_week <- function(the_date) {
return(lubridate::date(the_date) - lubridate::wday(the_date-1) +1 )
}
and then I computed the incidence然后我计算了发生率
output <- data %>%
group_by(week, Location) %>%
summarise(cases = n()) %>%
left_join(., pop, by = c("Location" = "Location")) %>%
mutate(inc = round(100000*n/pop,2)) %>%
select(-pop)
now I have to compute the number of actual positives per week and place and this is driving me crazy.现在我必须计算每周和地点的实际阳性数,这让我发疯。
The problem:问题:
Every row in my dataset is a person, I have a variable for the date of infection, and one for the date of recovery/death.我的数据集中的每一行都是一个人,我有一个变量表示感染日期,一个变量表示恢复/死亡日期。 In between the two dates the patient is positive and I have to include it in the group_by
but I don't know how.在这两个日期之间,患者呈阳性,我必须将其包含在group_by
中,但我不知道如何。
Example toy dataset:示例玩具数据集:
Patientid病人编号 | date_of_infection感染日期 | date_of_recovery_death date_of_recovery_death | Location地点 | Week星期 |
---|---|---|---|---|
1 1 | 2020-02-21 2020-02-21 | 2020-03-02 2020-03-02 | A一个 | 2020-02-17 2020-02-17 |
2 2 | 2020-02-23 2020-02-23 | 2020-04-15 2020-04-15 | A一个 | 2020-02-17 2020-02-17 |
3 3 | 2020-02-26 2020-02-26 | 2020-03-12 2020-03-12 | B乙 | 2020-02-24 2020-02-24 |
... ... | ... ... | ... ... | ... ... | ... ... |
This might be going to help这可能会有所帮助
df <- read.table(text = "Patientid date_of_infection date_of_recovery_death Location Week
1 2020-02-21 2020-03-02 A 2020-02-17
2 2020-02-23 2020-04-15 A 2020-02-17
3 2020-02-26 2020-03-12 B 2020-02-24", header = T)
suppressMessages(library(tidyverse, lubridate))
df %>% pivot_longer(c(date_of_infection, date_of_recovery_death), names_prefix = 'date_of_',
names_to = 'event', values_to = 'date') %>%
mutate(date = as.Date(date),
Week = date - lubridate::wday(date-1) +1,
dummy = ifelse(event == 'infection', 1, -1)) %>%
group_by(Week, Location) %>%
summarise(active_cases_addition_or_recovered = sum(dummy), .groups = 'drop') %>%
arrange(Location, Week) %>%
group_by(Location) %>%
mutate(net_active_cases = cumsum(active_cases_addition_or_recovered))
#> # A tibble: 5 x 4
#> Week Location active_cases_addition_or_recovered net_active_cases
#> <date> <chr> <dbl> <dbl>
#> 1 2020-02-17 A 2 2
#> 2 2020-03-02 A -1 1
#> 3 2020-04-13 A -1 0
#> 4 2020-02-24 B 1 1
#> 5 2020-03-09 B -1 0
Created on 2021-05-05 by the reprex package (v2.0.0)由代表 package (v2.0.0) 于 2021 年 5 月 5 日创建
If there is some missing data, this will further help in better presentation如果有一些缺失的数据,这将进一步有助于更好地展示
df %>% pivot_longer(c(date_of_infection, date_of_recovery_death), names_prefix = 'date_of_',
names_to = 'event', values_to = 'date') %>%
mutate(date = as.Date(date),
Week = date - lubridate::wday(date-1) +1,
dummy = ifelse(event == 'infection', 1, -1)) %>%
group_by(Week, Location) %>%
summarise(active_cases_addition_or_recovered = sum(dummy), .groups = 'drop') %>%
complete(Week = seq.Date(min(Week), max(Week), by = '7 days'),
nesting(Location),
fill = list(active_cases_addition_or_recovered = 0)) %>%
arrange(Location, Week) %>%
group_by(Location) %>%
mutate(net_active_cases = cumsum(active_cases_addition_or_recovered))
#> # A tibble: 18 x 4
#> Week Location active_cases_addition_or_recovered net_active_cases
#> <date> <chr> <dbl> <dbl>
#> 1 2020-02-17 A 2 2
#> 2 2020-02-24 A 0 2
#> 3 2020-03-02 A -1 1
#> 4 2020-03-09 A 0 1
#> 5 2020-03-16 A 0 1
#> 6 2020-03-23 A 0 1
#> 7 2020-03-30 A 0 1
#> 8 2020-04-06 A 0 1
#> 9 2020-04-13 A -1 0
#> 10 2020-02-17 B 0 0
#> 11 2020-02-24 B 1 1
#> 12 2020-03-02 B 0 1
#> 13 2020-03-09 B -1 0
#> 14 2020-03-16 B 0 0
#> 15 2020-03-23 B 0 0
#> 16 2020-03-30 B 0 0
#> 17 2020-04-06 B 0 0
#> 18 2020-04-13 B 0 0
Created on 2021-05-05 by the reprex package (v2.0.0)由代表 package (v2.0.0) 于 2021 年 5 月 5 日创建
library(dplyr)
library(tidyr)
library(lubridate)
df <- read.table(text = "Patientid date_of_infection date_of_recovery_death Location
1 2020-02-21 2020-03-02 A
2 2020-02-23 2020-04-15 A
3 2020-02-26 2020-03-12 B", header = T)
Floor dates to the beginning of the week地板日期到一周的开始
# data preparation
df <- df %>%
mutate(across(c(date_of_infection, date_of_recovery_death), as_date)) %>%
mutate(across(c(date_of_infection, date_of_recovery_death), floor_date, unit = "week", week_start = 1))
Expand the weeks of infection for each Patient-Location and then count the patients that are infected in each week and Location.扩大每个 Patient-Location 的感染周数,然后计算每周和 Location 中感染的患者。
# number of infected by week
df %>%
rowwise() %>%
summarise(week = seq.Date(date_of_infection, date_of_recovery_death, by = "7 days"),
Patientid, Location) %>%
count(Location, week)
#> # A tibble: 12 x 3
#> Location week n
#> <chr> <date> <int>
#> 1 A 2020-02-17 2
#> 2 A 2020-02-24 2
#> 3 A 2020-03-02 2
#> 4 A 2020-03-09 1
#> 5 A 2020-03-16 1
#> 6 A 2020-03-23 1
#> 7 A 2020-03-30 1
#> 8 A 2020-04-06 1
#> 9 A 2020-04-13 1
#> 10 B 2020-02-24 1
#> 11 B 2020-03-02 1
#> 12 B 2020-03-09 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.