I have a data manipulation and exclusion challenge that I just can't figure out how to approach successfully. I have data in a tidy format, all observations are rows. Here is a reprex for my dataset:
quarter <- c("Q4", "Q3", "Q2","Q1", "Q3", "Q2", "Q1","Q4", "Q2", "Q1", "Q4", "Q3", "Q2", "Q1","Q4", "Q3", "Q1")
year <- c("2020", "2020","2020","2020","2019","2019","2019", "2020", "2020","2020","2019","2019","2019","2019", "2020", "2020","2020")
country <- c("Brazil","Brazil","Brazil","Brazil","Brazil","Brazil","Brazil","Brazil","Brazil","Brazil", "Brazil","Brazil","Brazil","Brazil","France","France","France")
indicator <- c("Testing","Testing", "Testing","Testing","Testing","Testing","Testing","TestingPos","TestingPos","TestingPos","TestingPos","TestingPos","TestingPos","TestingPos", "Testing","Testing","Testing")
value <- sample(c(1:10), 17, replace = T)
quarterlydf <- data.frame(quarter, year, country, indicator, value)
quarter year country indicator value
1 Q4 2020 Brazil Testing 9
2 Q3 2020 Brazil Testing 3
3 Q2 2020 Brazil Testing 2
4 Q1 2020 Brazil Testing 7
5 Q3 2019 Brazil Testing 1
6 Q2 2019 Brazil Testing 5
7 Q1 2019 Brazil Testing 6
8 Q4 2020 Brazil TestingPos 4
9 Q2 2020 Brazil TestingPos 4
10 Q1 2020 Brazil TestingPos 3
11 Q4 2019 Brazil TestingPos 7
12 Q3 2019 Brazil TestingPos 2
13 Q2 2019 Brazil TestingPos 8
14 Q1 2019 Brazil TestingPos 1
15 Q4 2020 France Testing 1
16 Q3 2020 France Testing 1
17 Q1 2020 France Testing 8
For each country and indicator combination, I need to find the most recent contiguous 4 quarter period. For that most recent set of four contiguous quarters (eg Q3 2019, Q4 2019, Q1 2020, Q2 2020), I need to create a new row in a new dataframe (annualdf here) with the country, the start and end quarter/year, the indicator, the sum and the mean of the values for the included quarters.
All other contiguous quarter sets should be discarded, anywhere there is not a contiguous set should be discarded.
The product should look like this for the preceding frame:
start end country indicator sum mean
1 Q1_2020 Q4_2020 Brazil Testing 21 5.25
2 Q3_2019 Q2_2020 Brazil TestingPos 16 4
I won't go into all I've tried, but it's gotten very very ugly, involving trying to reassign sequential ids to each possible quarter/year combination, then use pivot_wider() to create multiple columns for each id, concatenate those columns into a single result, then use a grotesque set of str_detect() searches to search and assign values. Long story short, I think the entire approach I'm trying is very bad and incredibly inelegant.
There HAS to be an elegant way to do this.
Any suggestions would be very, very much appreciated. Thank you.
EDIT1: Per Limey there was a minor typo in the desired output (Q2_2019 was supposed to be Q2_2020). This has been fixed.
Though a bit long syntax (i Will try for shorter) but this will work. Only assumption lied here is that no year is completely missing, otherwise that field also needs to be completed by complete
. Else these will work
quarterlydf %>%
arrange(desc(year, quarter)) %>%
group_by(country, indicator, year) %>%
complete(quarter = rev(c("Q1", "Q2", "Q3", "Q4"))) %>%
group_by(country, indicator) %>%
arrange(desc(year), desc(quarter), .by_group = T) %>%
filter(with(rle(is.na(value)), rep(lengths, lengths)) >=4, !is.na(value)) %>%
slice_head(n = 4) %>%
summarise(start = paste0(last(year),"_", last(quarter)),
end = paste0(first(year),"_", first(quarter)),
sum = sum(value),
mean = mean(value))
# A tibble: 2 x 6
# Groups: country [1]
country indicator start end sum mean
<chr> <chr> <chr> <chr> <int> <dbl>
1 Brazil Testing 2020_Q1 2020_Q4 18 4.5
2 Brazil TestingPos 2019_Q3 2020_Q2 16 4
can be done reversed (chronologically) too
quarterlydf %>%
arrange(year, quarter) %>%
group_by(country, indicator, year) %>%
complete(quarter = c("Q1", "Q2", "Q3", "Q4")) %>%
group_by(country, indicator) %>%
filter(with(rle(is.na(value)), rep(lengths, lengths)) >=4, !is.na(value)) %>%
slice_tail(n = 4) %>%
summarise(start = paste0(first(year),"_", first(quarter)),
end = paste0(last(year),"_", last(quarter)),
sum = sum(value),
mean = mean(value))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.