![](/img/trans.png)
[英]How can I tidy a very messy long format data set using tidyverse or base-R functions?
[英]How do I aggregate messy quarterly data in R using Tidyverse, searching for first contiguous set of four quarters
我有一個數據操作和排除挑戰,我只是不知道如何成功解決。 我的數據格式整齊,所有觀察結果都是行。 這是我的數據集的代表:
quarter <- c("Q4", "Q3", "Q2","Q1", "Q3", "Q2", "Q1","Q4", "Q2", "Q1", "Q4", "Q3", "Q2", "Q1","Q4", "Q3", "Q1")
year <- c("2020", "2020","2020","2020","2019","2019","2019", "2020", "2020","2020","2019","2019","2019","2019", "2020", "2020","2020")
country <- c("Brazil","Brazil","Brazil","Brazil","Brazil","Brazil","Brazil","Brazil","Brazil","Brazil", "Brazil","Brazil","Brazil","Brazil","France","France","France")
indicator <- c("Testing","Testing", "Testing","Testing","Testing","Testing","Testing","TestingPos","TestingPos","TestingPos","TestingPos","TestingPos","TestingPos","TestingPos", "Testing","Testing","Testing")
value <- sample(c(1:10), 17, replace = T)
quarterlydf <- data.frame(quarter, year, country, indicator, value)
quarter year country indicator value
1 Q4 2020 Brazil Testing 9
2 Q3 2020 Brazil Testing 3
3 Q2 2020 Brazil Testing 2
4 Q1 2020 Brazil Testing 7
5 Q3 2019 Brazil Testing 1
6 Q2 2019 Brazil Testing 5
7 Q1 2019 Brazil Testing 6
8 Q4 2020 Brazil TestingPos 4
9 Q2 2020 Brazil TestingPos 4
10 Q1 2020 Brazil TestingPos 3
11 Q4 2019 Brazil TestingPos 7
12 Q3 2019 Brazil TestingPos 2
13 Q2 2019 Brazil TestingPos 8
14 Q1 2019 Brazil TestingPos 1
15 Q4 2020 France Testing 1
16 Q3 2020 France Testing 1
17 Q1 2020 France Testing 8
對於每個國家和指標組合,我需要找到最近的連續 4 個季度。 對於最近的一組四個連續季度(例如,2019 年第三季度、2019 年第四季度、2020 年第一季度、2020 年第二季度),我需要在新的 dataframe(此處為年度)中創建一個新行,其中包含國家、開始和結束季度/年、指標、包含季度的值的總和和平均值。
所有其他連續的四分之一集都應該被丟棄,任何不存在連續集的地方都應該被丟棄。
前一幀的產品應如下所示:
start end country indicator sum mean
1 Q1_2020 Q4_2020 Brazil Testing 21 5.25
2 Q3_2019 Q2_2020 Brazil TestingPos 16 4
我不會 go 到我嘗試過的所有內容中,但它變得非常非常難看,涉及嘗試將順序 ID 重新分配給每個可能的季度/年度組合,然后使用 pivot_wider() 為每個 ID 創建多個列,將這些列連接到一個結果,然后使用一組怪誕的 str_detect() 搜索來搜索和分配值。 長話短說,我認為我正在嘗試的整個方法非常糟糕而且非常不雅。
必須有一種優雅的方式來做到這一點。
任何建議都會非常非常感謝。 謝謝你。
EDIT1:Per Limey 在所需的 output 中有一個小錯字(Q2_2019 應該是 Q2_2020)。 這已得到修復。
雖然語法有點長(我會嘗試更短),但這會起作用。 這里唯一的假設是沒有年份完全丟失,否則該字段也需要由complete
。 否則這些將起作用
quarterlydf %>%
arrange(desc(year, quarter)) %>%
group_by(country, indicator, year) %>%
complete(quarter = rev(c("Q1", "Q2", "Q3", "Q4"))) %>%
group_by(country, indicator) %>%
arrange(desc(year), desc(quarter), .by_group = T) %>%
filter(with(rle(is.na(value)), rep(lengths, lengths)) >=4, !is.na(value)) %>%
slice_head(n = 4) %>%
summarise(start = paste0(last(year),"_", last(quarter)),
end = paste0(first(year),"_", first(quarter)),
sum = sum(value),
mean = mean(value))
# A tibble: 2 x 6
# Groups: country [1]
country indicator start end sum mean
<chr> <chr> <chr> <chr> <int> <dbl>
1 Brazil Testing 2020_Q1 2020_Q4 18 4.5
2 Brazil TestingPos 2019_Q3 2020_Q2 16 4
也可以倒過來(按時間順序)
quarterlydf %>%
arrange(year, quarter) %>%
group_by(country, indicator, year) %>%
complete(quarter = c("Q1", "Q2", "Q3", "Q4")) %>%
group_by(country, indicator) %>%
filter(with(rle(is.na(value)), rep(lengths, lengths)) >=4, !is.na(value)) %>%
slice_tail(n = 4) %>%
summarise(start = paste0(first(year),"_", first(quarter)),
end = paste0(last(year),"_", last(quarter)),
sum = sum(value),
mean = mean(value))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.