![](/img/trans.png)
[英]How do I aggregate a dataframe and sum the values of a column by repeated rows in r
[英]How do I recategorize values and aggregate rows of a dataset in R?
我需要聚合数据集的行以折叠年龄范围。 我的数据集目前有 5 岁的年龄范围。 我试图将这些年龄范围组合成类别,同时对一些变量(人口、X1、X2、X3 和 X4)求和,同时保持变量“类别”,该变量对于该特定 ID 中的每一行都是相同的。
我的数据集如下所示:
ID Age.Range Population X1 X2 X3 X4 Category
1 05-09 years 10 1 0 0 1 a
1 10-14 years 20 0 0 1 0 a
1 30-34 years 10 0 0 1 0 a
1 40-44 years 15 2 0 0 1 a
2 05-09 years 15 1 1 0 2 b
2 25-29 years 10 0 0 0 0 b
3 10-14 years 15 0 1 2 0 a
3 15-19 years 10 1 0 0 1 a
3 20-24 years 15 0 0 1 3 a
3 30-34 years 20 0 0 1 0 a
3 35-39 years 10 0 1 0 0 a
我正在尝试生成一个新的 dataframe,它结合了年龄,以便我的新年龄范围为 05-29 岁、30-39 岁和 40-49 岁,所以它看起来像这样:
ID Age.Range Population X1 X2 X3 X4 Category
1 05-29 years 30 1 0 1 1 a
1 30-39 years 10 0 0 1 0 a
1 40-49 years 15 2 0 0 1 a
2 05-29 years 25 1 1 0 2 a
3 05-29 years 40 1 1 3 4 a
3 30-39 years 30 0 1 1 0 a
我试过用 dplyr 这样做但没有成功。 任何帮助,将不胜感激!
这应该有效:
your_data %>%
mutate(
First.Age.In.Range = as.numeric(str_extract(Age.Range, "^[0-9]+"))
New.Age.Range = case_when(
First.Age.In.Range < 30 ~ "05-29 years",
First.Age.In.Range < 40 ~ "30-39 years",
First.Age.In.Range < 50 ~ "40-49 years",
First.Age.In.Range < 60 ~ "50-59 years",
## not sure how high you need to go
## catch-all for the last category
TRUE ~ "90-99 years"
)
) %>%
group_by(ID, New.Age.Range, Population, Category) %>%
summarize(across(starts_with("X"), sum))
这是一个使用tidyr
、 stringr
和dplyr
包的解决方案。 它类似于 Gregor Thomas 提供的内容。 在我们等待添加编辑时,它还让其他人有机会与可重现的示例进行交互。
df <- structure(list(ID = c(1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3), Age.Range = c("05-09 years",
"10-14 years", "30-34 years", "40-44 years", "05-09 years", "25-29 years",
"10-14 years", "15-19 years", "20-24 years", "30-34 years", "35-39 years"
), Population = c(10L, 20L, 10L, 15L, 15L, 10L, 15L, 10L, 15L,
20L, 10L), X1 = c(1L, 0L, 0L, 2L, 1L, 0L, 0L, 1L, 0L, 0L, 0L),
X2 = c(0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L), X3 = c(0L,
1L, 1L, 0L, 0L, 0L, 2L, 0L, 1L, 1L, 0L), X4 = c(1L, 0L, 0L,
1L, 2L, 0L, 0L, 1L, 3L, 0L, 0L), Category = c("a", "a", "a",
"a", "b", "b", "a", "a", "a", "a", "a")), class = "data.frame", row.names = c(NA,
-11L))
library(stringr)
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
separate(col = Age.Range, into = c("Age_1", "Age_2"), sep = "-") %>%
# You will have to add ifelse statements if you have ages that are >49 in your dataset.
mutate(
Age_2 = str_remove(Age_2, " years"),
Age_1 = ifelse(Age_2 <= 29, "05-29 years", Age_1),
Age_1 = ifelse(Age_2 > 29 & Age_2 <= 39, "30-39 years", Age_1),
Age_1 = ifelse(Age_2 > 39 & Age_2 <= 49, "40-49 years", Age_1)
) %>%
rename(Age.Range = Age_1) %>%
group_by(ID, Category, Age.Range) %>%
summarise(across(
.cols = Population:X4, sum
)) %>%
select(ID, Age.Range, Population, X1, X2, X3, X4, Category)
#> # A tibble: 6 x 8
#> # Groups: ID, Category [3]
#> ID Age.Range Population X1 X2 X3 X4 Category
#> <dbl> <chr> <int> <int> <int> <int> <int> <chr>
#> 1 1 05-29 years 30 1 0 1 1 a
#> 2 1 30-39 years 10 0 0 1 0 a
#> 3 1 40-49 years 15 2 0 0 1 a
#> 4 2 05-29 years 25 1 1 0 2 b
#> 5 3 05-29 years 40 1 1 3 4 a
#> 6 3 30-39 years 30 0 1 1 0 a
由reprex package (v0.3.0) 创建于 2020-11-15
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.