简体   繁体   English

Group_by 和 R 中的日期摘要

[英]Group_by and between date summary in R

fellow R users! R 用户!

I've spent the last 2 hours banging my head on this problem and I couldn't get any solution.在过去的 2 个小时里,我一直在努力解决这个问题,但我找不到任何解决方案。

Intro:介绍:

I'm working on a covid dataset and I have to compute the incidence and prevalence on a week basing and on a given place.我正在研究一个 covid 数据集,我必须根据给定地点计算一周的发病率和流行率。

The incidence is easy, my code is something like:发生率很容易,我的代码是这样的:

I created a column "week" that assigns the date into a week with this function:我创建了一个列“week”,用这个 function 将日期分配到一周:

floor_date_by_week <- function(the_date) {
  return(lubridate::date(the_date) - lubridate::wday(the_date-1) +1 )
}

and then I computed the incidence然后我计算了发生率

output <- data %>% 
  group_by(week, Location) %>% 
  summarise(cases = n()) %>% 
  left_join(., pop, by = c("Location" = "Location")) %>% 
  mutate(inc = round(100000*n/pop,2)) %>% 
  select(-pop)

now I have to compute the number of actual positives per week and place and this is driving me crazy.现在我必须计算每周和地点的实际阳性数,这让我发疯。

The problem:问题:

Every row in my dataset is a person, I have a variable for the date of infection, and one for the date of recovery/death.我的数据集中的每一行都是一个人,我有一个变量表示感染日期,一个变量表示恢复/死亡日期。 In between the two dates the patient is positive and I have to include it in the group_by but I don't know how.在这两个日期之间,患者呈阳性,我必须将其包含在group_by中,但我不知道如何。

Example toy dataset:示例玩具数据集:

Patientid病人编号 date_of_infection感染日期 date_of_recovery_death date_of_recovery_death Location地点 Week星期
1 1 2020-02-21 2020-02-21 2020-03-02 2020-03-02 A一个 2020-02-17 2020-02-17
2 2 2020-02-23 2020-02-23 2020-04-15 2020-04-15 A一个 2020-02-17 2020-02-17
3 3 2020-02-26 2020-02-26 2020-03-12 2020-03-12 B 2020-02-24 2020-02-24
... ... ... ... ... ... ... ... ... ...

This might be going to help这可能会有所帮助

df <- read.table(text = "Patientid  date_of_infection   date_of_recovery_death  Location    Week
1   2020-02-21  2020-03-02  A   2020-02-17
2   2020-02-23  2020-04-15  A   2020-02-17
3   2020-02-26  2020-03-12  B   2020-02-24", header = T)

suppressMessages(library(tidyverse, lubridate))
df %>% pivot_longer(c(date_of_infection, date_of_recovery_death), names_prefix = 'date_of_',
                    names_to = 'event', values_to = 'date') %>%
  mutate(date = as.Date(date),
         Week = date - lubridate::wday(date-1) +1,
         dummy = ifelse(event == 'infection', 1, -1)) %>%
  group_by(Week, Location) %>%
  summarise(active_cases_addition_or_recovered = sum(dummy), .groups = 'drop') %>%
  arrange(Location, Week) %>%
  group_by(Location) %>%
  mutate(net_active_cases = cumsum(active_cases_addition_or_recovered))
#> # A tibble: 5 x 4
#>   Week       Location active_cases_addition_or_recovered net_active_cases
#>   <date>     <chr>                                 <dbl>            <dbl>
#> 1 2020-02-17 A                                         2                2
#> 2 2020-03-02 A                                        -1                1
#> 3 2020-04-13 A                                        -1                0
#> 4 2020-02-24 B                                         1                1
#> 5 2020-03-09 B                                        -1                0

Created on 2021-05-05 by the reprex package (v2.0.0)代表 package (v2.0.0) 于 2021 年 5 月 5 日创建

If there is some missing data, this will further help in better presentation如果有一些缺失的数据,这将进一步有助于更好地展示

df %>% pivot_longer(c(date_of_infection, date_of_recovery_death), names_prefix = 'date_of_',
                    names_to = 'event', values_to = 'date') %>%
  mutate(date = as.Date(date),
         Week = date - lubridate::wday(date-1) +1,
         dummy = ifelse(event == 'infection', 1, -1)) %>%
  group_by(Week, Location) %>%
  summarise(active_cases_addition_or_recovered = sum(dummy), .groups = 'drop') %>%
  complete(Week = seq.Date(min(Week), max(Week), by = '7 days'), 
           nesting(Location), 
           fill = list(active_cases_addition_or_recovered = 0)) %>%
  arrange(Location, Week) %>%
  group_by(Location) %>%
  mutate(net_active_cases = cumsum(active_cases_addition_or_recovered))
#> # A tibble: 18 x 4
#>    Week       Location active_cases_addition_or_recovered net_active_cases
#>    <date>     <chr>                                 <dbl>            <dbl>
#>  1 2020-02-17 A                                         2                2
#>  2 2020-02-24 A                                         0                2
#>  3 2020-03-02 A                                        -1                1
#>  4 2020-03-09 A                                         0                1
#>  5 2020-03-16 A                                         0                1
#>  6 2020-03-23 A                                         0                1
#>  7 2020-03-30 A                                         0                1
#>  8 2020-04-06 A                                         0                1
#>  9 2020-04-13 A                                        -1                0
#> 10 2020-02-17 B                                         0                0
#> 11 2020-02-24 B                                         1                1
#> 12 2020-03-02 B                                         0                1
#> 13 2020-03-09 B                                        -1                0
#> 14 2020-03-16 B                                         0                0
#> 15 2020-03-23 B                                         0                0
#> 16 2020-03-30 B                                         0                0
#> 17 2020-04-06 B                                         0                0
#> 18 2020-04-13 B                                         0                0

Created on 2021-05-05 by the reprex package (v2.0.0)代表 package (v2.0.0) 于 2021 年 5 月 5 日创建

Libraries & Data图书馆和数据

library(dplyr)
library(tidyr)
library(lubridate)

df <- read.table(text = "Patientid  date_of_infection   date_of_recovery_death  Location
1   2020-02-21  2020-03-02  A
2   2020-02-23  2020-04-15  A
3   2020-02-26  2020-03-12  B", header = T)

Data preparation数据准备

Floor dates to the beginning of the week地板日期到一周的开始

# data preparation
df <- df %>%
  mutate(across(c(date_of_infection, date_of_recovery_death), as_date)) %>% 
  mutate(across(c(date_of_infection, date_of_recovery_death), floor_date, unit = "week", week_start = 1))

Number of infected感染人数

Expand the weeks of infection for each Patient-Location and then count the patients that are infected in each week and Location.扩大每个 Patient-Location 的感染周数,然后计算每周和 Location 中感染的患者。

# number of infected by week
df %>% 
  rowwise() %>% 
  summarise(week = seq.Date(date_of_infection, date_of_recovery_death, by = "7 days"),
            Patientid, Location) %>% 
  count(Location, week)

#> # A tibble: 12 x 3
#>    Location week           n
#>    <chr>    <date>     <int>
#>  1 A        2020-02-17     2
#>  2 A        2020-02-24     2
#>  3 A        2020-03-02     2
#>  4 A        2020-03-09     1
#>  5 A        2020-03-16     1
#>  6 A        2020-03-23     1
#>  7 A        2020-03-30     1
#>  8 A        2020-04-06     1
#>  9 A        2020-04-13     1
#> 10 B        2020-02-24     1
#> 11 B        2020-03-02     1
#> 12 B        2020-03-09     1 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM