简体   繁体   中英

How do I aggregate messy quarterly data in R using Tidyverse, searching for first contiguous set of four quarters

I have a data manipulation and exclusion challenge that I just can't figure out how to approach successfully. I have data in a tidy format, all observations are rows. Here is a reprex for my dataset:

quarter <- c("Q4", "Q3", "Q2","Q1", "Q3", "Q2", "Q1","Q4", "Q2", "Q1", "Q4", "Q3", "Q2", "Q1","Q4", "Q3", "Q1")
year <- c("2020", "2020","2020","2020","2019","2019","2019", "2020", "2020","2020","2019","2019","2019","2019", "2020", "2020","2020")
country <- c("Brazil","Brazil","Brazil","Brazil","Brazil","Brazil","Brazil","Brazil","Brazil","Brazil", "Brazil","Brazil","Brazil","Brazil","France","France","France")
indicator <- c("Testing","Testing", "Testing","Testing","Testing","Testing","Testing","TestingPos","TestingPos","TestingPos","TestingPos","TestingPos","TestingPos","TestingPos", "Testing","Testing","Testing")
value <- sample(c(1:10), 17, replace = T)

quarterlydf <- data.frame(quarter, year, country, indicator, value)

quarter year country  indicator value
1       Q4 2020  Brazil    Testing     9
2       Q3 2020  Brazil    Testing     3
3       Q2 2020  Brazil    Testing     2
4       Q1 2020  Brazil    Testing     7
5       Q3 2019  Brazil    Testing     1
6       Q2 2019  Brazil    Testing     5
7       Q1 2019  Brazil    Testing     6
8       Q4 2020  Brazil TestingPos     4
9       Q2 2020  Brazil TestingPos     4
10      Q1 2020  Brazil TestingPos     3
11      Q4 2019  Brazil TestingPos     7
12      Q3 2019  Brazil TestingPos     2
13      Q2 2019  Brazil TestingPos     8
14      Q1 2019  Brazil TestingPos     1
15      Q4 2020  France    Testing     1
16      Q3 2020  France    Testing     1
17      Q1 2020  France    Testing     8

For each country and indicator combination, I need to find the most recent contiguous 4 quarter period. For that most recent set of four contiguous quarters (eg Q3 2019, Q4 2019, Q1 2020, Q2 2020), I need to create a new row in a new dataframe (annualdf here) with the country, the start and end quarter/year, the indicator, the sum and the mean of the values for the included quarters.

All other contiguous quarter sets should be discarded, anywhere there is not a contiguous set should be discarded.

The product should look like this for the preceding frame:

start     end country  indicator sum mean
1 Q1_2020 Q4_2020  Brazil    Testing  21 5.25
2 Q3_2019 Q2_2020  Brazil TestingPos  16    4

I won't go into all I've tried, but it's gotten very very ugly, involving trying to reassign sequential ids to each possible quarter/year combination, then use pivot_wider() to create multiple columns for each id, concatenate those columns into a single result, then use a grotesque set of str_detect() searches to search and assign values. Long story short, I think the entire approach I'm trying is very bad and incredibly inelegant.

There HAS to be an elegant way to do this.

Any suggestions would be very, very much appreciated. Thank you.

EDIT1: Per Limey there was a minor typo in the desired output (Q2_2019 was supposed to be Q2_2020). This has been fixed.

Though a bit long syntax (i Will try for shorter) but this will work. Only assumption lied here is that no year is completely missing, otherwise that field also needs to be completed by complete . Else these will work

quarterlydf %>% 
  arrange(desc(year, quarter)) %>%
  group_by(country, indicator, year) %>%
  complete(quarter = rev(c("Q1", "Q2", "Q3", "Q4"))) %>%
  group_by(country, indicator) %>%
  arrange(desc(year), desc(quarter), .by_group = T) %>%
  filter(with(rle(is.na(value)), rep(lengths, lengths)) >=4, !is.na(value)) %>%
  slice_head(n = 4) %>%
  summarise(start = paste0(last(year),"_", last(quarter)),
            end = paste0(first(year),"_", first(quarter)),
            sum = sum(value),
            mean = mean(value))

# A tibble: 2 x 6
# Groups:   country [1]
  country indicator  start   end       sum  mean
  <chr>   <chr>      <chr>   <chr>   <int> <dbl>
1 Brazil  Testing    2020_Q1 2020_Q4    18   4.5
2 Brazil  TestingPos 2019_Q3 2020_Q2    16   4 

can be done reversed (chronologically) too

quarterlydf %>% 
  arrange(year, quarter) %>%
  group_by(country, indicator, year) %>%
  complete(quarter = c("Q1", "Q2", "Q3", "Q4")) %>%
  group_by(country, indicator) %>%
  filter(with(rle(is.na(value)), rep(lengths, lengths)) >=4, !is.na(value)) %>%
  slice_tail(n = 4) %>%
  summarise(start = paste0(first(year),"_", first(quarter)),
            end = paste0(last(year),"_", last(quarter)),
            sum = sum(value),
            mean = mean(value))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM