簡體   English   中英

按周統計活躍的觀察

[英]Count active observations by week

我有一個觀察數據框,每個觀察的開始和結束日期都表明它處於活動狀態。 活躍的持續時間因觀察而異,並且可以跨越數周。 一些觀察仍然有效並且沒有結束日期。

對於給定的日期范圍,我如何計算在該日期范圍內一周內活躍的觀察數量,包括那些仍然活躍的?

我有一個可行的粗略方法,但速度很慢。 似乎必須有一種更有效、更簡單的方法來做到這一點。

編輯:我的第一種方法類似於 Ronak 的解決方案,對於較小的數據集,這絕對比我的要好,但我的真實數據集有更多的觀察和更長的日期范圍,所以我遇到了 memory 約束。

#I'm primarily using tidyverse/lubridate, but definitely open to other solutions.
library(tidyverse)
library(lubridate)

# sample data frame of observations with start and end dates:
df_obs <- tibble(
  observation = c(1:10),
  date_start = as_date(c("2020-03-17", "2020-01-20", "2020-02-06", "2020-01-04", "2020-01-06", "2020-01-24", "2020-01-09", "2020-02-11", "2020-03-13", "2020-02-07")),
  date_end = as_date(c("2020-03-27", "2020-03-20", NA, "2020-03-04", "2020-01-16", "2020-02-24", NA, "2020-02-19", NA, "2020-02-27"))
  ) 

# to account for observations that are still active, NAs are converted to today's date:
df_obs <- mutate(df_obs, date_end = if_else(is.na(date_end), Sys.Date(), date_end)) 

# create a data frame of weeks by start and end date to count the active observations in a given week 
# for this example I'm just using date ranges from the sample data: 
df_weeks <- 
  seq(min(df_obs$date_start), max(df_obs$date_start), by = 'day') %>% 
  enframe(NULL, 'week_start') %>% 
  mutate(week_start = as_date(cut(week_start, "week"))) %>% 
  mutate(week_end = week_start + 6) %>% 
  distinct()

# create a function that filters the observations data frame based on start and end dates:   
check_active <- function(d, s, e){
  d %>% 
    filter(date_start <= e) %>% 
    filter(date_end >= s) %>% 
    nrow()
}

# applying that function to each week in the date range data frame gives the expected results:
df_weeks %>% 
  rowwise() %>% 
  mutate(total_active = check_active(df_obs, week_start, week_end)) %>%
  select(-week_end) %>% 
  ungroup()
# A tibble: 12 x 2
   week_start total_active
  <date>            <int>
 1 2019-12-30            1
 2 2020-01-06            3
 3 2020-01-13            3
 4 2020-01-20            4
 5 2020-01-27            4
 6 2020-02-03            6
 7 2020-02-10            7
 8 2020-02-17            7
 9 2020-02-24            6
10 2020-03-02            4
11 2020-03-09            4
12 2020-03-16            5

這是一種方法:

library(tidyverse)

df_obs %>%
  #Replace NA with today's date
  #Create sequence between start and end date
  mutate(date_end = replace(date_end, is.na(date_end), Sys.Date()),
         date = map2(date_start, date_end, seq, "day")) %>%
  #Get data in long format
  unnest(date) %>%
  #Unselect start an end date
  select(-date_start, -date_end) %>%
  #Cut data by week
  mutate(date = cut(date, "week")) %>%
  #Get unique rows for observation and date
  distinct(observation, date) %>%
  #Count number of observation in each week
  count(date)

返回:

# A tibble: 14 x 2
#   value          n
#   <fct>      <int>
# 1 2019-12-30     1
# 2 2020-01-06     3
# 3 2020-01-13     3
# 4 2020-01-20     4
# 5 2020-01-27     4
# 6 2020-02-03     6
# 7 2020-02-10     7
# 8 2020-02-17     7
# 9 2020-02-24     6
#10 2020-03-02     4
#11 2020-03-09     4
#12 2020-03-16     5
#13 2020-03-23     4
#14 2020-03-30     3

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM