简体   繁体   中英

How do I recategorize values and aggregate rows of a dataset in R?

I need to aggregate rows of a dataset to collapse age ranges. My dataset currently has 5-year age ranges. I'm trying to combine these age ranges into categories while summing some of the variables (Population, X1, X2, X3, and X4), while keeping the variable "Category" which is the same for each row within that specific ID.

My dataset looks like this:

ID    Age.Range    Population   X1   X2   X3   X4   Category
1     05-09 years  10           1    0    0    1    a
1     10-14 years  20           0    0    1    0    a
1     30-34 years  10           0    0    1    0    a
1     40-44 years  15           2    0    0    1    a
2     05-09 years  15           1    1    0    2    b
2     25-29 years  10           0    0    0    0    b
3     10-14 years  15           0    1    2    0    a
3     15-19 years  10           1    0    0    1    a
3     20-24 years  15           0    0    1    3    a
3     30-34 years  20           0    0    1    0    a
3     35-39 years  10           0    1    0    0    a

I'm trying to produce a new dataframe that combines ages so that mynew age ranges are 05-29 years, 30-39 years, and 40-49 years, so it would look like this:

ID    Age.Range    Population   X1   X2   X3   X4   Category
1     05-29 years  30           1    0    1    1    a
1     30-39 years  10           0    0    1    0    a
1     40-49 years  15           2    0    0    1    a
2     05-29 years  25           1    1    0    2    a
3     05-29 years  40           1    1    3    4    a
3     30-39 years  30           0    1    1    0    a

I've tried doing this with dplyr to no success. Any help would be appreciated!

This should work:

your_data %>%
  mutate(
    First.Age.In.Range = as.numeric(str_extract(Age.Range, "^[0-9]+"))
    New.Age.Range = case_when(
      First.Age.In.Range < 30 ~ "05-29 years",
      First.Age.In.Range < 40 ~ "30-39 years",
      First.Age.In.Range < 50 ~ "40-49 years",
      First.Age.In.Range < 60 ~ "50-59 years",    
      ## not sure how high you need to go 
      ## catch-all for the last category
      TRUE ~ "90-99 years"
    )
  ) %>%
  group_by(ID, New.Age.Range, Population, Category) %>%
  summarize(across(starts_with("X"), sum))

Here is a solution using the tidyr , stringr , and dplyr packages. It is similar to what Gregor Thomas provided. It also gives others the chance to interact with a reproducible example while we await the edits to be added.

df <- structure(list(ID = c(1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3), Age.Range = c("05-09 years", 
"10-14 years", "30-34 years", "40-44 years", "05-09 years", "25-29 years", 
"10-14 years", "15-19 years", "20-24 years", "30-34 years", "35-39 years"
), Population = c(10L, 20L, 10L, 15L, 15L, 10L, 15L, 10L, 15L, 
20L, 10L), X1 = c(1L, 0L, 0L, 2L, 1L, 0L, 0L, 1L, 0L, 0L, 0L), 
    X2 = c(0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L), X3 = c(0L, 
    1L, 1L, 0L, 0L, 0L, 2L, 0L, 1L, 1L, 0L), X4 = c(1L, 0L, 0L, 
    1L, 2L, 0L, 0L, 1L, 3L, 0L, 0L), Category = c("a", "a", "a", 
    "a", "b", "b", "a", "a", "a", "a", "a")), class = "data.frame", row.names = c(NA, 
-11L))


library(stringr)
library(dplyr)
library(tidyr)

df %>% 
  group_by(ID) %>% 
  separate(col = Age.Range, into = c("Age_1", "Age_2"), sep = "-") %>%
  # You will have to add ifelse statements if you have ages that are >49 in your dataset. 
  mutate(
    Age_2 = str_remove(Age_2, " years"),
    Age_1 = ifelse(Age_2 <= 29, "05-29 years", Age_1),
    Age_1 = ifelse(Age_2 > 29 & Age_2 <= 39, "30-39 years", Age_1),
    Age_1 = ifelse(Age_2 > 39 & Age_2 <= 49, "40-49 years", Age_1)
  ) %>%
  rename(Age.Range = Age_1) %>% 
  group_by(ID, Category, Age.Range) %>% 
  summarise(across(
    .cols = Population:X4, sum
  )) %>% 
  select(ID, Age.Range, Population, X1, X2, X3, X4, Category)


#> # A tibble: 6 x 8
#> # Groups:   ID, Category [3]
#>      ID Age.Range   Population    X1    X2    X3    X4 Category
#>   <dbl> <chr>            <int> <int> <int> <int> <int> <chr>   
#> 1     1 05-29 years         30     1     0     1     1 a       
#> 2     1 30-39 years         10     0     0     1     0 a       
#> 3     1 40-49 years         15     2     0     0     1 a       
#> 4     2 05-29 years         25     1     1     0     2 b       
#> 5     3 05-29 years         40     1     1     3     4 a       
#> 6     3 30-39 years         30     0     1     1     0 a

Created on 2020-11-15 by the reprex package (v0.3.0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM