[英]How to do a nested 2 level reshape in R?
I am currently working on a survey project that asks about respondents' economic activities.我目前正在做一个调查项目,询问受访者的经济活动。 So we will ask the jobs/businesses they have, and since some jobs/businesses are seasonal, we ask about their revenue, cost, and profit in each season.
所以我们会询问他们拥有的工作/业务,并且由于一些工作/业务是季节性的,我们会询问他们每个季节的收入、成本和利润。
I am a bit stuck on how to reshape this in R from wide to long.我有点坚持如何在 R 中从宽到长重塑这个。 This is what I have:
这就是我所拥有的:
respondent_name![]() |
job_name_1![]() |
seasonality_type_1![]() |
job_name_2![]() |
seasonality_type_2![]() |
cst_1_1 ![]() |
rev_1_1 ![]() |
cst_1_2 ![]() |
rev_1_2 ![]() |
cst_2_1 ![]() |
rev_2_1 ![]() |
cst_2_2 ![]() |
rev_2_2 ![]() |
---|---|---|---|---|---|---|---|---|---|---|---|---|
James![]() |
farmer![]() |
high / low![]() |
teacher![]() |
active / inactive![]() |
5000 ![]() |
6000 ![]() |
2000 ![]() |
3000 ![]() |
100 ![]() |
200 ![]() |
0 ![]() |
0 ![]() |
Alice![]() |
livestock rearing![]() |
high / low![]() |
barber![]() |
active / inactive![]() |
100000 ![]() |
200000 ![]() |
20000 ![]() |
30000 ![]() |
5000 ![]() |
7000 ![]() |
2000 ![]() |
0 ![]() |
I want the output to look like:我希望 output 看起来像:
respondent_name![]() |
job_number![]() |
job_name![]() |
season_number![]() |
season![]() |
cost![]() |
revenue![]() |
---|---|---|---|---|---|---|
James![]() |
1 ![]() |
farmer![]() |
1 ![]() |
high![]() |
5000 ![]() |
6000 ![]() |
James![]() |
1 ![]() |
farmer![]() |
2 ![]() |
low![]() |
2000 ![]() |
3000 ![]() |
James![]() |
2 ![]() |
teacher![]() |
1 ![]() |
active![]() |
100 ![]() |
200 ![]() |
James![]() |
2 ![]() |
teacher![]() |
2 ![]() |
inactive![]() |
0 ![]() |
0 ![]() |
Alice![]() |
1 ![]() |
livestock rearing![]() |
1 ![]() |
high![]() |
100000 ![]() |
200000 ![]() |
Alice![]() |
1 ![]() |
livestock rearing![]() |
2 ![]() |
low![]() |
2000 ![]() |
3000 ![]() |
Alice![]() |
2 ![]() |
barber![]() |
1 ![]() |
active![]() |
5000 ![]() |
7000 ![]() |
Alice![]() |
2 ![]() |
barber![]() |
2 ![]() |
inactive![]() |
2000 ![]() |
0 ![]() |
Anyone has idea how to do this?任何人都知道如何做到这一点? I used the melt function but can't figure out how to not lose the 2 level information.
我使用了熔化 function 但不知道如何不丢失 2 级信息。 What melt gives me is just job_season_1, job_season_2, job_season_3, job_season_4, which lose the pair wise info between job and season.
融化给我的只是job_season_1,job_season_2,job_season_3,job_season_4,它们丢失了工作和季节之间的配对信息。
Thanks so much!非常感谢!
Here is the toy dataset in the table above:这是上表中的玩具数据集:
survey = data.frame(respondent_name = c("James", "Alice"),
job_name_1 = c("farmer", "livestock rearing"),
seasonality_type_1 = c("high / low", "high / low"),
job_name_2 = c("teacher", "barber"),
seasonality_type_2 = c("active / inactive", "active / inactive"),
cst_1_1 = c(5000, 100000),
rev_1_1 = c(6000, 200000),
cst_1_2 = c(2000, 20000),
rev_1_2 = c(3000, 30000),
cst_2_1 = c(100, 5000),
rev_2_1 = c(200, 7000),
cst_2_2 = c(0, 2000),
rev_2_2 = c(0, 0)
)
first split the seasonality type1 and type2 into 2 columns and then melt
, for melt i use the column numbers because your column names are too long首先将季节性类型 1 和类型 2 分成 2 列,然后
melt
,对于融化,我使用列号,因为您的列名太长
library(data.table)
survey[, c("s1", "s2") := tstrsplit(seasonality_type_1, "/")]
survey[, c("s1_2", "s2_2") := tstrsplit(seasonality_type_2, "/")]
melt(survey, id.vars=1, measure.vars=list(c(2,2,4,4),c(14,15,16,17), c(6,8,10,12), c(7,9,11,13)))
This uses tidyverse and not data.table, but gets you most of the way there...这使用 tidyverse 而不是 data.table,但可以让你大部分时间......
library(tidyverse)
survey2 <- survey %>%
separate(seasonality_type_1,
sep = " / ",
into = c("season_1_1", "season_1_2")) %>%
separate(seasonality_type_2,
sep = " / ",
into = c("season_2_1", "season_2_2")) %>%
pivot_longer(cols = starts_with("job_name"),
names_to = "job_number",
names_prefix = "job_name_",
values_to = "job"
)
survey2_1 <- survey2 %>%
select(respondent_name, job_number, job, ends_with("_1")) %>%
mutate(
season = 1,
cost = ifelse(job_number == 1, cst_1_1, cst_2_1)
)
survey2_2 <- survey2 %>%
select(respondent_name, job_number, job, ends_with("_2")) %>%
mutate(
season = 2,
cost = ifelse(job_number == 1, cst_1_2, cst_2_2)
)
What's still missing:还缺少什么:
library(tidyverse)
seasonality_desc <- tribble(
~ job_number, ~ season_number, ~ season,
1, 1, "high",
1, 2, "low",
2, 1, "active",
2, 2, "inactive"
)
survey %>%
select(!contains("seasonality_type")) %>%
rename_with(
~ str_replace(.x, pattern = "^(cst|rev)_([0-9])_([0-9])$",
replacement = "\\1_\\3_\\2"),
.cols = matches("^(cst|rev)_[0-9]_[0-9]$")
) %>%
pivot_longer(
!respondent_name, names_to = c(".value", "job_number"),
names_pattern = "(.*)_([0-9]+)$",
names_transform = list(job_number = as.numeric)
) %>%
pivot_longer(
matches("cst|rev"), names_to = c(".value", "season_number"),
names_sep = "_", names_transform = list(season_number = as.numeric)
) %>%
left_join(seasonality_desc, by = c("job_number", "season_number")) %>%
select(
respondent_name, job_number, job_name, season_number, season, cost = cst,
revenue = rev
)
#> # A tibble: 8 × 7
#> respondent_name job_number job_name season_number season cost revenue
#> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 James 1 farmer 1 high 5e3 6000
#> 2 James 1 farmer 2 low 2e3 3000
#> 3 James 2 teacher 1 active 1e2 200
#> 4 James 2 teacher 2 inactive 0 0
#> 5 Alice 1 livestock rea… 1 high 1e5 200000
#> 6 Alice 1 livestock rea… 2 low 2e4 30000
#> 7 Alice 2 barber 1 active 5e3 7000
#> 8 Alice 2 barber 2 inactive 2e3 0
Created on 2022-02-04 by the reprex package (v2.0.1)由代表 package (v2.0.1) 于 2022 年 2 月 4 日创建
We'll use two instances of tidyr 's pivot_longer()
to lengthen the dataset, but there's some other wrangling that's needed in between.我们将使用tidyr的
pivot_longer()
的两个实例来延长数据集,但在这之间还需要一些其他的争论。
First, seasonality_type_1
and seasonality_type_2
do not provide any individual-level information, but rather only metadata specifying the labels that correspond to the different season numbers in the context of the job_number
.首先,
seasonality_type_1
和seasonality_type_2
不提供任何个人级别的信息,而只是指定与job_number
上下文中不同季节编号相对应的标签的元数据。 So, manually make a tibble containing these labels, and delete the metadata from the original dataset:因此,手动创建一个包含这些标签的 tibble,并从原始数据集中删除元数据:
library(tidyverse)
seasonality_desc <- tribble(
~ job_number, ~ season_number, ~ season,
1, 1, "high",
1, 2, "low",
2, 1, "active",
2, 2, "inactive"
)
survey2 <- select(survey, !contains("seasonality_type"))
knitr::kable(survey2)
respondent_name![]() |
job_name_1![]() |
job_name_2![]() |
cst_1_1 ![]() |
rev_1_1 ![]() |
cst_1_2 ![]() |
rev_1_2 ![]() |
cst_2_1 ![]() |
rev_2_1 ![]() |
cst_2_2 ![]() |
rev_2_2 ![]() |
---|---|---|---|---|---|---|---|---|---|---|
James![]() |
farmer![]() |
teacher![]() |
5e+03 ![]() |
6e+03 ![]() |
2000 ![]() |
3000 ![]() |
100 ![]() |
200 ![]() |
0 ![]() |
0 ![]() |
Alice![]() |
livestock rearing![]() |
barber![]() |
1e+05 ![]() |
2e+05 ![]() |
20000 ![]() |
30000 ![]() |
5000 ![]() |
7000 ![]() |
2000 ![]() |
0 ![]() |
The first pivot is by job name and number.第一个 pivot 按作业名称和编号。 Unfortunately, the job number appears in the middle of the
cst
/ rev
column names, and this doesn't play well with pivot_longer()
.不幸的是,工作编号出现在
cst
/ rev
列名的中间,这与pivot_longer()
配合得不好。 You can switch the job number with the season number manually, but if you're looking for an automated way of doing it, here is one way using regex:您可以手动将工作编号与季节编号切换,但如果您正在寻找一种自动化的方式,这是使用正则表达式的一种方法:
survey3 <- rename_with(
survey2,
~ str_replace(.x, pattern = "^(cst|rev)_([0-9])_([0-9])$",
replacement = "\\1_\\3_\\2"),
.cols = matches("^(cst|rev)_[0-9]_[0-9]$")
)
knitr::kable(survey3)
respondent_name![]() |
job_name_1![]() |
job_name_2![]() |
cst_1_1 ![]() |
rev_1_1 ![]() |
cst_2_1 ![]() |
rev_2_1 ![]() |
cst_1_2 ![]() |
rev_1_2 ![]() |
cst_2_2 ![]() |
rev_2_2 ![]() |
---|---|---|---|---|---|---|---|---|---|---|
James![]() |
farmer![]() |
teacher![]() |
5e+03 ![]() |
6e+03 ![]() |
2000 ![]() |
3000 ![]() |
100 ![]() |
200 ![]() |
0 ![]() |
0 ![]() |
Alice![]() |
livestock rearing![]() |
barber![]() |
1e+05 ![]() |
2e+05 ![]() |
20000 ![]() |
30000 ![]() |
5000 ![]() |
7000 ![]() |
2000 ![]() |
0 ![]() |
Now notice a pattern: all column names except respondent_name
is of the form <VARIABLE>_<JOB NUMBER>
-- exactly the specification we'll provide pivot_longer()
(apologies, stack overflow is having a hard time rendering the markdown output for the table):现在注意一个模式:除了
respondent_name
之外的所有列名的格式都是<VARIABLE>_<JOB NUMBER>
- 正是我们将提供pivot_longer()
的规范(抱歉,堆栈溢出很难为桌子):
survey4 <- survey3 %>%
pivot_longer(
!respondent_name, names_to = c(".value", "job_number"),
names_pattern = "(.*)_([0-9]+)$",
names_transform = list(job_number = as.numeric)
)
knitr::kable(survey4)
|respondent_name | job_number|job_name | cst_1| rev_1| cst_2| rev_2|
|:---------------|----------:|:-----------------|-----:|-----:|-----:|-----:|
|James | 1|farmer | 5e+03| 6e+03| 2000| 3000|
|James | 2|teacher | 1e+02| 2e+02| 0| 0|
|Alice | 1|livestock rearing | 1e+05| 2e+05| 20000| 30000|
|Alice | 2|barber | 5e+03| 7e+03| 2000| 0|
Now for the second pivot, notice that the cst
and rev
columns are of the form <VARIABLE>_<SEASON NUMBER>
-- exactly the specification we'll provide pivot_longer()
.现在对于第二个 pivot,请注意
cst
和rev
列的格式为<VARIABLE>_<SEASON NUMBER>
- 正是我们将提供的规范pivot_longer()
。 This time, it's enough to just specify that an underscore is separating the names (we couldn't last time because there were more than one underscores present in the column names) (apologies again about the table):这一次,只需指定一个下划线来分隔名称就足够了(我们上次不能,因为列名中存在多个下划线)(再次为表格道歉):
survey5 <- survey4 %>%
pivot_longer(
matches("cst|rev"), names_to = c(".value", "season_number"),
names_sep = "_", names_transform = list(season_number = as.numeric)
)
knitr::kable(survey5)
|respondent_name | job_number|job_name | season_number| cst| rev|
|:---------------|----------:|:-----------------|-------------:|-----:|-----:|
|James | 1|farmer | 1| 5e+03| 6e+03|
|James | 1|farmer | 2| 2e+03| 3e+03|
|James | 2|teacher | 1| 1e+02| 2e+02|
|James | 2|teacher | 2| 0e+00| 0e+00|
|Alice | 1|livestock rearing | 1| 1e+05| 2e+05|
|Alice | 1|livestock rearing | 2| 2e+04| 3e+04|
|Alice | 2|barber | 1| 5e+03| 7e+03|
|Alice | 2|barber | 2| 2e+03| 0e+00|
Now we can add in the labels stored in the metadata seasonality_desc
we made earlier, reorder the columns as desired, and rename "cst" to "cost" and "rev" to "revenue":现在我们可以添加存储在我们之前创建的元数据
seasonality_desc
中的标签,根据需要重新排序列,并将“cst”重命名为“cost”,将“rev”重命名为“revenue”:
survey6 <- survey5 %>%
left_join(seasonality_desc, by = c("job_number", "season_number")) %>%
select(
respondent_name, job_number, job_name, season_number, season, cost = cst,
revenue = rev
)
knitr::kable(survey6)
respondent_name![]() |
job_number![]() |
job_name![]() |
season_number![]() |
season![]() |
cost![]() |
revenue![]() |
---|---|---|---|---|---|---|
James![]() |
1 ![]() |
farmer![]() |
1 ![]() |
high![]() |
5e+03 ![]() |
6e+03 ![]() |
James![]() |
1 ![]() |
farmer![]() |
2 ![]() |
low![]() |
2e+03 ![]() |
3e+03 ![]() |
James![]() |
2 ![]() |
teacher![]() |
1 ![]() |
active![]() |
1e+02 ![]() |
2e+02 ![]() |
James![]() |
2 ![]() |
teacher![]() |
2 ![]() |
inactive![]() |
0e+00 ![]() |
0e+00 ![]() |
Alice![]() |
1 ![]() |
livestock rearing![]() |
1 ![]() |
high![]() |
1e+05 ![]() |
2e+05 ![]() |
Alice![]() |
1 ![]() |
livestock rearing![]() |
2 ![]() |
low![]() |
2e+04 ![]() |
3e+04 ![]() |
Alice![]() |
2 ![]() |
barber![]() |
1 ![]() |
active![]() |
5e+03 ![]() |
7e+03 ![]() |
Alice![]() |
2 ![]() |
barber![]() |
2 ![]() |
inactive![]() |
2e+03 ![]() |
0e+00 ![]() |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.