I am currently working on a survey project that asks about respondents' economic activities. So we will ask the jobs/businesses they have, and since some jobs/businesses are seasonal, we ask about their revenue, cost, and profit in each season.
I am a bit stuck on how to reshape this in R from wide to long. This is what I have:
respondent_name | job_name_1 | seasonality_type_1 | job_name_2 | seasonality_type_2 | cst_1_1 | rev_1_1 | cst_1_2 | rev_1_2 | cst_2_1 | rev_2_1 | cst_2_2 | rev_2_2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
James | farmer | high / low | teacher | active / inactive | 5000 | 6000 | 2000 | 3000 | 100 | 200 | 0 | 0 |
Alice | livestock rearing | high / low | barber | active / inactive | 100000 | 200000 | 20000 | 30000 | 5000 | 7000 | 2000 | 0 |
I want the output to look like:
respondent_name | job_number | job_name | season_number | season | cost | revenue |
---|---|---|---|---|---|---|
James | 1 | farmer | 1 | high | 5000 | 6000 |
James | 1 | farmer | 2 | low | 2000 | 3000 |
James | 2 | teacher | 1 | active | 100 | 200 |
James | 2 | teacher | 2 | inactive | 0 | 0 |
Alice | 1 | livestock rearing | 1 | high | 100000 | 200000 |
Alice | 1 | livestock rearing | 2 | low | 2000 | 3000 |
Alice | 2 | barber | 1 | active | 5000 | 7000 |
Alice | 2 | barber | 2 | inactive | 2000 | 0 |
Anyone has idea how to do this? I used the melt function but can't figure out how to not lose the 2 level information. What melt gives me is just job_season_1, job_season_2, job_season_3, job_season_4, which lose the pair wise info between job and season.
Thanks so much!
Here is the toy dataset in the table above:
survey = data.frame(respondent_name = c("James", "Alice"),
job_name_1 = c("farmer", "livestock rearing"),
seasonality_type_1 = c("high / low", "high / low"),
job_name_2 = c("teacher", "barber"),
seasonality_type_2 = c("active / inactive", "active / inactive"),
cst_1_1 = c(5000, 100000),
rev_1_1 = c(6000, 200000),
cst_1_2 = c(2000, 20000),
rev_1_2 = c(3000, 30000),
cst_2_1 = c(100, 5000),
rev_2_1 = c(200, 7000),
cst_2_2 = c(0, 2000),
rev_2_2 = c(0, 0)
)
first split the seasonality type1 and type2 into 2 columns and then melt
, for melt i use the column numbers because your column names are too long
library(data.table)
survey[, c("s1", "s2") := tstrsplit(seasonality_type_1, "/")]
survey[, c("s1_2", "s2_2") := tstrsplit(seasonality_type_2, "/")]
melt(survey, id.vars=1, measure.vars=list(c(2,2,4,4),c(14,15,16,17), c(6,8,10,12), c(7,9,11,13)))
This uses tidyverse and not data.table, but gets you most of the way there...
library(tidyverse)
survey2 <- survey %>%
separate(seasonality_type_1,
sep = " / ",
into = c("season_1_1", "season_1_2")) %>%
separate(seasonality_type_2,
sep = " / ",
into = c("season_2_1", "season_2_2")) %>%
pivot_longer(cols = starts_with("job_name"),
names_to = "job_number",
names_prefix = "job_name_",
values_to = "job"
)
survey2_1 <- survey2 %>%
select(respondent_name, job_number, job, ends_with("_1")) %>%
mutate(
season = 1,
cost = ifelse(job_number == 1, cst_1_1, cst_2_1)
)
survey2_2 <- survey2 %>%
select(respondent_name, job_number, job, ends_with("_2")) %>%
mutate(
season = 2,
cost = ifelse(job_number == 1, cst_1_2, cst_2_2)
)
What's still missing:
library(tidyverse)
seasonality_desc <- tribble(
~ job_number, ~ season_number, ~ season,
1, 1, "high",
1, 2, "low",
2, 1, "active",
2, 2, "inactive"
)
survey %>%
select(!contains("seasonality_type")) %>%
rename_with(
~ str_replace(.x, pattern = "^(cst|rev)_([0-9])_([0-9])$",
replacement = "\\1_\\3_\\2"),
.cols = matches("^(cst|rev)_[0-9]_[0-9]$")
) %>%
pivot_longer(
!respondent_name, names_to = c(".value", "job_number"),
names_pattern = "(.*)_([0-9]+)$",
names_transform = list(job_number = as.numeric)
) %>%
pivot_longer(
matches("cst|rev"), names_to = c(".value", "season_number"),
names_sep = "_", names_transform = list(season_number = as.numeric)
) %>%
left_join(seasonality_desc, by = c("job_number", "season_number")) %>%
select(
respondent_name, job_number, job_name, season_number, season, cost = cst,
revenue = rev
)
#> # A tibble: 8 × 7
#> respondent_name job_number job_name season_number season cost revenue
#> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 James 1 farmer 1 high 5e3 6000
#> 2 James 1 farmer 2 low 2e3 3000
#> 3 James 2 teacher 1 active 1e2 200
#> 4 James 2 teacher 2 inactive 0 0
#> 5 Alice 1 livestock rea… 1 high 1e5 200000
#> 6 Alice 1 livestock rea… 2 low 2e4 30000
#> 7 Alice 2 barber 1 active 5e3 7000
#> 8 Alice 2 barber 2 inactive 2e3 0
Created on 2022-02-04 by the reprex package (v2.0.1)
We'll use two instances of tidyr 's pivot_longer()
to lengthen the dataset, but there's some other wrangling that's needed in between.
First, seasonality_type_1
and seasonality_type_2
do not provide any individual-level information, but rather only metadata specifying the labels that correspond to the different season numbers in the context of the job_number
. So, manually make a tibble containing these labels, and delete the metadata from the original dataset:
library(tidyverse)
seasonality_desc <- tribble(
~ job_number, ~ season_number, ~ season,
1, 1, "high",
1, 2, "low",
2, 1, "active",
2, 2, "inactive"
)
survey2 <- select(survey, !contains("seasonality_type"))
knitr::kable(survey2)
respondent_name | job_name_1 | job_name_2 | cst_1_1 | rev_1_1 | cst_1_2 | rev_1_2 | cst_2_1 | rev_2_1 | cst_2_2 | rev_2_2 |
---|---|---|---|---|---|---|---|---|---|---|
James | farmer | teacher | 5e+03 | 6e+03 | 2000 | 3000 | 100 | 200 | 0 | 0 |
Alice | livestock rearing | barber | 1e+05 | 2e+05 | 20000 | 30000 | 5000 | 7000 | 2000 | 0 |
The first pivot is by job name and number. Unfortunately, the job number appears in the middle of the cst
/ rev
column names, and this doesn't play well with pivot_longer()
. You can switch the job number with the season number manually, but if you're looking for an automated way of doing it, here is one way using regex:
survey3 <- rename_with(
survey2,
~ str_replace(.x, pattern = "^(cst|rev)_([0-9])_([0-9])$",
replacement = "\\1_\\3_\\2"),
.cols = matches("^(cst|rev)_[0-9]_[0-9]$")
)
knitr::kable(survey3)
respondent_name | job_name_1 | job_name_2 | cst_1_1 | rev_1_1 | cst_2_1 | rev_2_1 | cst_1_2 | rev_1_2 | cst_2_2 | rev_2_2 |
---|---|---|---|---|---|---|---|---|---|---|
James | farmer | teacher | 5e+03 | 6e+03 | 2000 | 3000 | 100 | 200 | 0 | 0 |
Alice | livestock rearing | barber | 1e+05 | 2e+05 | 20000 | 30000 | 5000 | 7000 | 2000 | 0 |
Now notice a pattern: all column names except respondent_name
is of the form <VARIABLE>_<JOB NUMBER>
-- exactly the specification we'll provide pivot_longer()
(apologies, stack overflow is having a hard time rendering the markdown output for the table):
survey4 <- survey3 %>%
pivot_longer(
!respondent_name, names_to = c(".value", "job_number"),
names_pattern = "(.*)_([0-9]+)$",
names_transform = list(job_number = as.numeric)
)
knitr::kable(survey4)
|respondent_name | job_number|job_name | cst_1| rev_1| cst_2| rev_2|
|:---------------|----------:|:-----------------|-----:|-----:|-----:|-----:|
|James | 1|farmer | 5e+03| 6e+03| 2000| 3000|
|James | 2|teacher | 1e+02| 2e+02| 0| 0|
|Alice | 1|livestock rearing | 1e+05| 2e+05| 20000| 30000|
|Alice | 2|barber | 5e+03| 7e+03| 2000| 0|
Now for the second pivot, notice that the cst
and rev
columns are of the form <VARIABLE>_<SEASON NUMBER>
-- exactly the specification we'll provide pivot_longer()
. This time, it's enough to just specify that an underscore is separating the names (we couldn't last time because there were more than one underscores present in the column names) (apologies again about the table):
survey5 <- survey4 %>%
pivot_longer(
matches("cst|rev"), names_to = c(".value", "season_number"),
names_sep = "_", names_transform = list(season_number = as.numeric)
)
knitr::kable(survey5)
|respondent_name | job_number|job_name | season_number| cst| rev|
|:---------------|----------:|:-----------------|-------------:|-----:|-----:|
|James | 1|farmer | 1| 5e+03| 6e+03|
|James | 1|farmer | 2| 2e+03| 3e+03|
|James | 2|teacher | 1| 1e+02| 2e+02|
|James | 2|teacher | 2| 0e+00| 0e+00|
|Alice | 1|livestock rearing | 1| 1e+05| 2e+05|
|Alice | 1|livestock rearing | 2| 2e+04| 3e+04|
|Alice | 2|barber | 1| 5e+03| 7e+03|
|Alice | 2|barber | 2| 2e+03| 0e+00|
Now we can add in the labels stored in the metadata seasonality_desc
we made earlier, reorder the columns as desired, and rename "cst" to "cost" and "rev" to "revenue":
survey6 <- survey5 %>%
left_join(seasonality_desc, by = c("job_number", "season_number")) %>%
select(
respondent_name, job_number, job_name, season_number, season, cost = cst,
revenue = rev
)
knitr::kable(survey6)
respondent_name | job_number | job_name | season_number | season | cost | revenue |
---|---|---|---|---|---|---|
James | 1 | farmer | 1 | high | 5e+03 | 6e+03 |
James | 1 | farmer | 2 | low | 2e+03 | 3e+03 |
James | 2 | teacher | 1 | active | 1e+02 | 2e+02 |
James | 2 | teacher | 2 | inactive | 0e+00 | 0e+00 |
Alice | 1 | livestock rearing | 1 | high | 1e+05 | 2e+05 |
Alice | 1 | livestock rearing | 2 | low | 2e+04 | 3e+04 |
Alice | 2 | barber | 1 | active | 5e+03 | 7e+03 |
Alice | 2 | barber | 2 | inactive | 2e+03 | 0e+00 |
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.