![](/img/trans.png)
[英]How do I aggregate a dataframe and sum the values of a column by repeated rows in r
[英]How do I recategorize values and aggregate rows of a dataset in R?
我需要聚合數據集的行以折疊年齡范圍。 我的數據集目前有 5 歲的年齡范圍。 我試圖將這些年齡范圍組合成類別,同時對一些變量(人口、X1、X2、X3 和 X4)求和,同時保持變量“類別”,該變量對於該特定 ID 中的每一行都是相同的。
我的數據集如下所示:
ID Age.Range Population X1 X2 X3 X4 Category
1 05-09 years 10 1 0 0 1 a
1 10-14 years 20 0 0 1 0 a
1 30-34 years 10 0 0 1 0 a
1 40-44 years 15 2 0 0 1 a
2 05-09 years 15 1 1 0 2 b
2 25-29 years 10 0 0 0 0 b
3 10-14 years 15 0 1 2 0 a
3 15-19 years 10 1 0 0 1 a
3 20-24 years 15 0 0 1 3 a
3 30-34 years 20 0 0 1 0 a
3 35-39 years 10 0 1 0 0 a
我正在嘗試生成一個新的 dataframe,它結合了年齡,以便我的新年齡范圍為 05-29 歲、30-39 歲和 40-49 歲,所以它看起來像這樣:
ID Age.Range Population X1 X2 X3 X4 Category
1 05-29 years 30 1 0 1 1 a
1 30-39 years 10 0 0 1 0 a
1 40-49 years 15 2 0 0 1 a
2 05-29 years 25 1 1 0 2 a
3 05-29 years 40 1 1 3 4 a
3 30-39 years 30 0 1 1 0 a
我試過用 dplyr 這樣做但沒有成功。 任何幫助,將不勝感激!
這應該有效:
your_data %>%
mutate(
First.Age.In.Range = as.numeric(str_extract(Age.Range, "^[0-9]+"))
New.Age.Range = case_when(
First.Age.In.Range < 30 ~ "05-29 years",
First.Age.In.Range < 40 ~ "30-39 years",
First.Age.In.Range < 50 ~ "40-49 years",
First.Age.In.Range < 60 ~ "50-59 years",
## not sure how high you need to go
## catch-all for the last category
TRUE ~ "90-99 years"
)
) %>%
group_by(ID, New.Age.Range, Population, Category) %>%
summarize(across(starts_with("X"), sum))
這是一個使用tidyr
、 stringr
和dplyr
包的解決方案。 它類似於 Gregor Thomas 提供的內容。 在我們等待添加編輯時,它還讓其他人有機會與可重現的示例進行交互。
df <- structure(list(ID = c(1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3), Age.Range = c("05-09 years",
"10-14 years", "30-34 years", "40-44 years", "05-09 years", "25-29 years",
"10-14 years", "15-19 years", "20-24 years", "30-34 years", "35-39 years"
), Population = c(10L, 20L, 10L, 15L, 15L, 10L, 15L, 10L, 15L,
20L, 10L), X1 = c(1L, 0L, 0L, 2L, 1L, 0L, 0L, 1L, 0L, 0L, 0L),
X2 = c(0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L), X3 = c(0L,
1L, 1L, 0L, 0L, 0L, 2L, 0L, 1L, 1L, 0L), X4 = c(1L, 0L, 0L,
1L, 2L, 0L, 0L, 1L, 3L, 0L, 0L), Category = c("a", "a", "a",
"a", "b", "b", "a", "a", "a", "a", "a")), class = "data.frame", row.names = c(NA,
-11L))
library(stringr)
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
separate(col = Age.Range, into = c("Age_1", "Age_2"), sep = "-") %>%
# You will have to add ifelse statements if you have ages that are >49 in your dataset.
mutate(
Age_2 = str_remove(Age_2, " years"),
Age_1 = ifelse(Age_2 <= 29, "05-29 years", Age_1),
Age_1 = ifelse(Age_2 > 29 & Age_2 <= 39, "30-39 years", Age_1),
Age_1 = ifelse(Age_2 > 39 & Age_2 <= 49, "40-49 years", Age_1)
) %>%
rename(Age.Range = Age_1) %>%
group_by(ID, Category, Age.Range) %>%
summarise(across(
.cols = Population:X4, sum
)) %>%
select(ID, Age.Range, Population, X1, X2, X3, X4, Category)
#> # A tibble: 6 x 8
#> # Groups: ID, Category [3]
#> ID Age.Range Population X1 X2 X3 X4 Category
#> <dbl> <chr> <int> <int> <int> <int> <int> <chr>
#> 1 1 05-29 years 30 1 0 1 1 a
#> 2 1 30-39 years 10 0 0 1 0 a
#> 3 1 40-49 years 15 2 0 0 1 a
#> 4 2 05-29 years 25 1 1 0 2 b
#> 5 3 05-29 years 40 1 1 3 4 a
#> 6 3 30-39 years 30 0 1 1 0 a
由reprex package (v0.3.0) 創建於 2020-11-15
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.