简体   繁体   中英

How to do a nested 2 level reshape in R?

I am currently working on a survey project that asks about respondents' economic activities. So we will ask the jobs/businesses they have, and since some jobs/businesses are seasonal, we ask about their revenue, cost, and profit in each season.

I am a bit stuck on how to reshape this in R from wide to long. This is what I have:

respondent_name job_name_1 seasonality_type_1 job_name_2 seasonality_type_2 cst_1_1 rev_1_1 cst_1_2 rev_1_2 cst_2_1 rev_2_1 cst_2_2 rev_2_2
James farmer high / low teacher active / inactive 5000 6000 2000 3000 100 200 0 0
Alice livestock rearing high / low barber active / inactive 100000 200000 20000 30000 5000 7000 2000 0

I want the output to look like:

respondent_name job_number job_name season_number season cost revenue
James 1 farmer 1 high 5000 6000
James 1 farmer 2 low 2000 3000
James 2 teacher 1 active 100 200
James 2 teacher 2 inactive 0 0
Alice 1 livestock rearing 1 high 100000 200000
Alice 1 livestock rearing 2 low 2000 3000
Alice 2 barber 1 active 5000 7000
Alice 2 barber 2 inactive 2000 0

Anyone has idea how to do this? I used the melt function but can't figure out how to not lose the 2 level information. What melt gives me is just job_season_1, job_season_2, job_season_3, job_season_4, which lose the pair wise info between job and season.

Thanks so much!

Here is the toy dataset in the table above:

survey = data.frame(respondent_name = c("James", "Alice"),
                    job_name_1 = c("farmer", "livestock rearing"),
                    seasonality_type_1 = c("high / low", "high / low"), 
                    job_name_2 = c("teacher", "barber"),
                    seasonality_type_2 = c("active / inactive", "active / inactive"),
                    cst_1_1 = c(5000, 100000),
                    rev_1_1 = c(6000, 200000),
                    cst_1_2 = c(2000, 20000),
                    rev_1_2 = c(3000, 30000),
                    cst_2_1 = c(100, 5000),
                    rev_2_1 = c(200, 7000),
                    cst_2_2 = c(0, 2000),
                    rev_2_2 = c(0, 0)
                    )

first split the seasonality type1 and type2 into 2 columns and then melt , for melt i use the column numbers because your column names are too long

library(data.table)
survey[, c("s1", "s2") := tstrsplit(seasonality_type_1, "/")]
survey[, c("s1_2", "s2_2") := tstrsplit(seasonality_type_2, "/")]
melt(survey, id.vars=1, measure.vars=list(c(2,2,4,4),c(14,15,16,17), c(6,8,10,12), c(7,9,11,13)))

This uses tidyverse and not data.table, but gets you most of the way there...

library(tidyverse)

survey2 <- survey %>%
  separate(seasonality_type_1, 
       sep = " / ", 
       into = c("season_1_1", "season_1_2")) %>%
  separate(seasonality_type_2,
       sep = " / ",
       into = c("season_2_1", "season_2_2")) %>%
  pivot_longer(cols = starts_with("job_name"),
           names_to = "job_number",
           names_prefix = "job_name_",
           values_to = "job"
          )

  survey2_1 <- survey2 %>%
    select(respondent_name, job_number, job, ends_with("_1")) %>%
    mutate(
      season = 1,
      cost = ifelse(job_number == 1, cst_1_1, cst_2_1)
    )

  survey2_2 <- survey2 %>%
    select(respondent_name, job_number, job, ends_with("_2")) %>%
    mutate(
      season = 2,
      cost = ifelse(job_number == 1, cst_1_2, cst_2_2)
    )

What's still missing:

  • Similar calculations for revenue as cost
  • Removing unneeded columns, ending with similar column names in each of survey2_1 and survey2_2
  • Joining (rbind) survey2_1 and survey2_2

The short answer

library(tidyverse)
seasonality_desc <- tribble(
  ~ job_number, ~ season_number, ~ season,
  1,            1,               "high",
  1,            2,               "low",
  2,            1,               "active",
  2,            2,               "inactive"
)
survey %>% 
  select(!contains("seasonality_type")) %>% 
  rename_with(
    ~ str_replace(.x, pattern = "^(cst|rev)_([0-9])_([0-9])$", 
                  replacement = "\\1_\\3_\\2"), 
    .cols = matches("^(cst|rev)_[0-9]_[0-9]$")
  ) %>% 
  pivot_longer(
    !respondent_name, names_to = c(".value", "job_number"), 
    names_pattern = "(.*)_([0-9]+)$", 
    names_transform = list(job_number = as.numeric)
  ) %>% 
  pivot_longer(
    matches("cst|rev"), names_to = c(".value", "season_number"),
    names_sep = "_", names_transform = list(season_number = as.numeric)
  ) %>% 
  left_join(seasonality_desc, by = c("job_number", "season_number")) %>% 
  select(
    respondent_name, job_number, job_name, season_number, season, cost = cst, 
    revenue = rev
  )
#> # A tibble: 8 × 7
#>   respondent_name job_number job_name       season_number season    cost revenue
#>   <chr>                <dbl> <chr>                  <dbl> <chr>    <dbl>   <dbl>
#> 1 James                    1 farmer                     1 high       5e3    6000
#> 2 James                    1 farmer                     2 low        2e3    3000
#> 3 James                    2 teacher                    1 active     1e2     200
#> 4 James                    2 teacher                    2 inactive   0         0
#> 5 Alice                    1 livestock rea…             1 high       1e5  200000
#> 6 Alice                    1 livestock rea…             2 low        2e4   30000
#> 7 Alice                    2 barber                     1 active     5e3    7000
#> 8 Alice                    2 barber                     2 inactive   2e3       0

Created on 2022-02-04 by the reprex package (v2.0.1)

The long answer

We'll use two instances of tidyr 's pivot_longer() to lengthen the dataset, but there's some other wrangling that's needed in between.

Remove metadata

First, seasonality_type_1 and seasonality_type_2 do not provide any individual-level information, but rather only metadata specifying the labels that correspond to the different season numbers in the context of the job_number . So, manually make a tibble containing these labels, and delete the metadata from the original dataset:

library(tidyverse)

seasonality_desc <- tribble(
  ~ job_number, ~ season_number, ~ season,
  1,            1,               "high",
  1,            2,               "low",
  2,            1,               "active",
  2,            2,               "inactive"
)

survey2 <-  select(survey, !contains("seasonality_type"))
knitr::kable(survey2)
respondent_name job_name_1 job_name_2 cst_1_1 rev_1_1 cst_1_2 rev_1_2 cst_2_1 rev_2_1 cst_2_2 rev_2_2
James farmer teacher 5e+03 6e+03 2000 3000 100 200 0 0
Alice livestock rearing barber 1e+05 2e+05 20000 30000 5000 7000 2000 0

First pivot: job number

The first pivot is by job name and number. Unfortunately, the job number appears in the middle of the cst / rev column names, and this doesn't play well with pivot_longer() . You can switch the job number with the season number manually, but if you're looking for an automated way of doing it, here is one way using regex:

survey3 <- rename_with(
  survey2,
  ~ str_replace(.x, pattern = "^(cst|rev)_([0-9])_([0-9])$", 
                replacement = "\\1_\\3_\\2"), 
  .cols = matches("^(cst|rev)_[0-9]_[0-9]$")
)
knitr::kable(survey3)
respondent_name job_name_1 job_name_2 cst_1_1 rev_1_1 cst_2_1 rev_2_1 cst_1_2 rev_1_2 cst_2_2 rev_2_2
James farmer teacher 5e+03 6e+03 2000 3000 100 200 0 0
Alice livestock rearing barber 1e+05 2e+05 20000 30000 5000 7000 2000 0

Now notice a pattern: all column names except respondent_name is of the form <VARIABLE>_<JOB NUMBER> -- exactly the specification we'll provide pivot_longer() (apologies, stack overflow is having a hard time rendering the markdown output for the table):

survey4 <- survey3 %>% 
  pivot_longer(
    !respondent_name, names_to = c(".value", "job_number"), 
    names_pattern = "(.*)_([0-9]+)$", 
    names_transform = list(job_number = as.numeric)
  )
knitr::kable(survey4)

|respondent_name | job_number|job_name          | cst_1| rev_1| cst_2| rev_2|
|:---------------|----------:|:-----------------|-----:|-----:|-----:|-----:|
|James           |          1|farmer            | 5e+03| 6e+03|  2000|  3000|
|James           |          2|teacher           | 1e+02| 2e+02|     0|     0|
|Alice           |          1|livestock rearing | 1e+05| 2e+05| 20000| 30000|
|Alice           |          2|barber            | 5e+03| 7e+03|  2000|     0|

Second pivot: season number

Now for the second pivot, notice that the cst and rev columns are of the form <VARIABLE>_<SEASON NUMBER> -- exactly the specification we'll provide pivot_longer() . This time, it's enough to just specify that an underscore is separating the names (we couldn't last time because there were more than one underscores present in the column names) (apologies again about the table):

survey5 <- survey4 %>%
  pivot_longer(
    matches("cst|rev"), names_to = c(".value", "season_number"),
    names_sep = "_", names_transform = list(season_number = as.numeric)
  )
knitr::kable(survey5)

|respondent_name | job_number|job_name          | season_number|   cst|   rev|
|:---------------|----------:|:-----------------|-------------:|-----:|-----:|
|James           |          1|farmer            |             1| 5e+03| 6e+03|
|James           |          1|farmer            |             2| 2e+03| 3e+03|
|James           |          2|teacher           |             1| 1e+02| 2e+02|
|James           |          2|teacher           |             2| 0e+00| 0e+00|
|Alice           |          1|livestock rearing |             1| 1e+05| 2e+05|
|Alice           |          1|livestock rearing |             2| 2e+04| 3e+04|
|Alice           |          2|barber            |             1| 5e+03| 7e+03|
|Alice           |          2|barber            |             2| 2e+03| 0e+00|

Add in labels and clean

Now we can add in the labels stored in the metadata seasonality_desc we made earlier, reorder the columns as desired, and rename "cst" to "cost" and "rev" to "revenue":

survey6 <- survey5 %>% 
  left_join(seasonality_desc, by = c("job_number", "season_number")) %>% 
  select(
    respondent_name, job_number, job_name, season_number, season, cost = cst, 
    revenue = rev
  )
knitr::kable(survey6)
respondent_name job_number job_name season_number season cost revenue
James 1 farmer 1 high 5e+03 6e+03
James 1 farmer 2 low 2e+03 3e+03
James 2 teacher 1 active 1e+02 2e+02
James 2 teacher 2 inactive 0e+00 0e+00
Alice 1 livestock rearing 1 high 1e+05 2e+05
Alice 1 livestock rearing 2 low 2e+04 3e+04
Alice 2 barber 1 active 5e+03 7e+03
Alice 2 barber 2 inactive 2e+03 0e+00

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM