简体   繁体   English

如何在 R 中进行嵌套的 2 级整形?

[英]How to do a nested 2 level reshape in R?

I am currently working on a survey project that asks about respondents' economic activities.我目前正在做一个调查项目,询问受访者的经济活动。 So we will ask the jobs/businesses they have, and since some jobs/businesses are seasonal, we ask about their revenue, cost, and profit in each season.所以我们会询问他们拥有的工作/业务,并且由于一些工作/业务是季节性的,我们会询问他们每个季节的收入、成本和利润。

I am a bit stuck on how to reshape this in R from wide to long.我有点坚持如何在 R 中从宽到长重塑这个。 This is what I have:这就是我所拥有的:

respondent_name受访者姓名 job_name_1工作名称_1 seasonality_type_1季节性_type_1 job_name_2工作名称_2 seasonality_type_2季节性类型_2 cst_1_1 cst_1_1 rev_1_1 rev_1_1 cst_1_2 cst_1_2 rev_1_2 rev_1_2 cst_2_1 cst_2_1 rev_2_1 rev_2_1 cst_2_2 cst_2_2 rev_2_2 rev_2_2
James詹姆士 farmer农民 high / low前高后低 teacher老师 active / inactive活跃/不活跃 5000 5000 6000 6000 2000 2000 3000 3000 100 100 200 200 0 0 0 0
Alice爱丽丝 livestock rearing牲畜饲养 high / low前高后低 barber理发师 active / inactive活跃/不活跃 100000 100000 200000 200000 20000 20000 30000 30000 5000 5000 7000 7000 2000 2000 0 0

I want the output to look like:我希望 output 看起来像:

respondent_name受访者姓名 job_number工作编号 job_name工作名称 season_number季数 season季节 cost成本 revenue收入
James詹姆士 1 1 farmer农民 1 1 high高的 5000 5000 6000 6000
James詹姆士 1 1 farmer农民 2 2 low低的 2000 2000 3000 3000
James詹姆士 2 2 teacher老师 1 1 active积极的 100 100 200 200
James詹姆士 2 2 teacher老师 2 2 inactive不活跃 0 0 0 0
Alice爱丽丝 1 1 livestock rearing牲畜饲养 1 1 high高的 100000 100000 200000 200000
Alice爱丽丝 1 1 livestock rearing牲畜饲养 2 2 low低的 2000 2000 3000 3000
Alice爱丽丝 2 2 barber理发师 1 1 active积极的 5000 5000 7000 7000
Alice爱丽丝 2 2 barber理发师 2 2 inactive不活跃 2000 2000 0 0

Anyone has idea how to do this?任何人都知道如何做到这一点? I used the melt function but can't figure out how to not lose the 2 level information.我使用了熔化 function 但不知道如何不丢失 2 级信息。 What melt gives me is just job_season_1, job_season_2, job_season_3, job_season_4, which lose the pair wise info between job and season.融化给我的只是job_season_1,job_season_2,job_season_3,job_season_4,它们丢失了工作和季节之间的配对信息。

Thanks so much!非常感谢!

Here is the toy dataset in the table above:这是上表中的玩具数据集:

survey = data.frame(respondent_name = c("James", "Alice"),
                    job_name_1 = c("farmer", "livestock rearing"),
                    seasonality_type_1 = c("high / low", "high / low"), 
                    job_name_2 = c("teacher", "barber"),
                    seasonality_type_2 = c("active / inactive", "active / inactive"),
                    cst_1_1 = c(5000, 100000),
                    rev_1_1 = c(6000, 200000),
                    cst_1_2 = c(2000, 20000),
                    rev_1_2 = c(3000, 30000),
                    cst_2_1 = c(100, 5000),
                    rev_2_1 = c(200, 7000),
                    cst_2_2 = c(0, 2000),
                    rev_2_2 = c(0, 0)
                    )

first split the seasonality type1 and type2 into 2 columns and then melt , for melt i use the column numbers because your column names are too long首先将季节性类型 1 和类型 2 分成 2 列,然后melt ,对于融化,我使用列号,因为您的列名太长

library(data.table)
survey[, c("s1", "s2") := tstrsplit(seasonality_type_1, "/")]
survey[, c("s1_2", "s2_2") := tstrsplit(seasonality_type_2, "/")]
melt(survey, id.vars=1, measure.vars=list(c(2,2,4,4),c(14,15,16,17), c(6,8,10,12), c(7,9,11,13)))

This uses tidyverse and not data.table, but gets you most of the way there...这使用 tidyverse 而不是 data.table,但可以让你大部分时间......

library(tidyverse)

survey2 <- survey %>%
  separate(seasonality_type_1, 
       sep = " / ", 
       into = c("season_1_1", "season_1_2")) %>%
  separate(seasonality_type_2,
       sep = " / ",
       into = c("season_2_1", "season_2_2")) %>%
  pivot_longer(cols = starts_with("job_name"),
           names_to = "job_number",
           names_prefix = "job_name_",
           values_to = "job"
          )

  survey2_1 <- survey2 %>%
    select(respondent_name, job_number, job, ends_with("_1")) %>%
    mutate(
      season = 1,
      cost = ifelse(job_number == 1, cst_1_1, cst_2_1)
    )

  survey2_2 <- survey2 %>%
    select(respondent_name, job_number, job, ends_with("_2")) %>%
    mutate(
      season = 2,
      cost = ifelse(job_number == 1, cst_1_2, cst_2_2)
    )

What's still missing:还缺少什么:

  • Similar calculations for revenue as cost收入与成本的类似计算
  • Removing unneeded columns, ending with similar column names in each of survey2_1 and survey2_2删除不需要的列,在survey2_1 和survey2_2 中以相似的列名结尾
  • Joining (rbind) survey2_1 and survey2_2加入(rbind)survey2_1 和survey2_2

The short answer简短的回答

library(tidyverse)
seasonality_desc <- tribble(
  ~ job_number, ~ season_number, ~ season,
  1,            1,               "high",
  1,            2,               "low",
  2,            1,               "active",
  2,            2,               "inactive"
)
survey %>% 
  select(!contains("seasonality_type")) %>% 
  rename_with(
    ~ str_replace(.x, pattern = "^(cst|rev)_([0-9])_([0-9])$", 
                  replacement = "\\1_\\3_\\2"), 
    .cols = matches("^(cst|rev)_[0-9]_[0-9]$")
  ) %>% 
  pivot_longer(
    !respondent_name, names_to = c(".value", "job_number"), 
    names_pattern = "(.*)_([0-9]+)$", 
    names_transform = list(job_number = as.numeric)
  ) %>% 
  pivot_longer(
    matches("cst|rev"), names_to = c(".value", "season_number"),
    names_sep = "_", names_transform = list(season_number = as.numeric)
  ) %>% 
  left_join(seasonality_desc, by = c("job_number", "season_number")) %>% 
  select(
    respondent_name, job_number, job_name, season_number, season, cost = cst, 
    revenue = rev
  )
#> # A tibble: 8 × 7
#>   respondent_name job_number job_name       season_number season    cost revenue
#>   <chr>                <dbl> <chr>                  <dbl> <chr>    <dbl>   <dbl>
#> 1 James                    1 farmer                     1 high       5e3    6000
#> 2 James                    1 farmer                     2 low        2e3    3000
#> 3 James                    2 teacher                    1 active     1e2     200
#> 4 James                    2 teacher                    2 inactive   0         0
#> 5 Alice                    1 livestock rea…             1 high       1e5  200000
#> 6 Alice                    1 livestock rea…             2 low        2e4   30000
#> 7 Alice                    2 barber                     1 active     5e3    7000
#> 8 Alice                    2 barber                     2 inactive   2e3       0

Created on 2022-02-04 by the reprex package (v2.0.1)代表 package (v2.0.1) 于 2022 年 2 月 4 日创建

The long answer长答案

We'll use two instances of tidyr 's pivot_longer() to lengthen the dataset, but there's some other wrangling that's needed in between.我们将使用tidyrpivot_longer()的两个实例来延长数据集,但在这之间还需要一些其他的争论。

Remove metadata删除元数据

First, seasonality_type_1 and seasonality_type_2 do not provide any individual-level information, but rather only metadata specifying the labels that correspond to the different season numbers in the context of the job_number .首先, seasonality_type_1seasonality_type_2不提供任何个人级别的信息,而只是指定与job_number上下文中不同季节编号相对应的标签的元数据。 So, manually make a tibble containing these labels, and delete the metadata from the original dataset:因此,手动创建一个包含这些标签的 tibble,并从原始数据集中删除元数据:

library(tidyverse)

seasonality_desc <- tribble(
  ~ job_number, ~ season_number, ~ season,
  1,            1,               "high",
  1,            2,               "low",
  2,            1,               "active",
  2,            2,               "inactive"
)

survey2 <-  select(survey, !contains("seasonality_type"))
knitr::kable(survey2)
respondent_name受访者姓名 job_name_1工作名称_1 job_name_2工作名称_2 cst_1_1 cst_1_1 rev_1_1 rev_1_1 cst_1_2 cst_1_2 rev_1_2 rev_1_2 cst_2_1 cst_2_1 rev_2_1 rev_2_1 cst_2_2 cst_2_2 rev_2_2 rev_2_2
James詹姆士 farmer农民 teacher老师 5e+03 5e+03 6e+03 6e+03 2000 2000 3000 3000 100 100 200 200 0 0 0 0
Alice爱丽丝 livestock rearing牲畜饲养 barber理发师 1e+05 1e+05 2e+05 2e+05 20000 20000 30000 30000 5000 5000 7000 7000 2000 2000 0 0

First pivot: job number第一个pivot:作业号

The first pivot is by job name and number.第一个 pivot 按作业名称和编号。 Unfortunately, the job number appears in the middle of the cst / rev column names, and this doesn't play well with pivot_longer() .不幸的是,工作编号出现在cst / rev列名的中间,这与pivot_longer()配合得不好。 You can switch the job number with the season number manually, but if you're looking for an automated way of doing it, here is one way using regex:您可以手动将工作编号与季节编号切换,但如果您正在寻找一种自动化的方式,这是使用正则表达式的一种方法:

survey3 <- rename_with(
  survey2,
  ~ str_replace(.x, pattern = "^(cst|rev)_([0-9])_([0-9])$", 
                replacement = "\\1_\\3_\\2"), 
  .cols = matches("^(cst|rev)_[0-9]_[0-9]$")
)
knitr::kable(survey3)
respondent_name受访者姓名 job_name_1工作名称_1 job_name_2工作名称_2 cst_1_1 cst_1_1 rev_1_1 rev_1_1 cst_2_1 cst_2_1 rev_2_1 rev_2_1 cst_1_2 cst_1_2 rev_1_2 rev_1_2 cst_2_2 cst_2_2 rev_2_2 rev_2_2
James詹姆士 farmer农民 teacher老师 5e+03 5e+03 6e+03 6e+03 2000 2000 3000 3000 100 100 200 200 0 0 0 0
Alice爱丽丝 livestock rearing牲畜饲养 barber理发师 1e+05 1e+05 2e+05 2e+05 20000 20000 30000 30000 5000 5000 7000 7000 2000 2000 0 0

Now notice a pattern: all column names except respondent_name is of the form <VARIABLE>_<JOB NUMBER> -- exactly the specification we'll provide pivot_longer() (apologies, stack overflow is having a hard time rendering the markdown output for the table):现在注意一个模式:除了respondent_name之外的所有列名的格式都是<VARIABLE>_<JOB NUMBER> - 正是我们将提供pivot_longer()的规范(抱歉,堆栈溢出很难为桌子):

survey4 <- survey3 %>% 
  pivot_longer(
    !respondent_name, names_to = c(".value", "job_number"), 
    names_pattern = "(.*)_([0-9]+)$", 
    names_transform = list(job_number = as.numeric)
  )
knitr::kable(survey4)

|respondent_name | job_number|job_name          | cst_1| rev_1| cst_2| rev_2|
|:---------------|----------:|:-----------------|-----:|-----:|-----:|-----:|
|James           |          1|farmer            | 5e+03| 6e+03|  2000|  3000|
|James           |          2|teacher           | 1e+02| 2e+02|     0|     0|
|Alice           |          1|livestock rearing | 1e+05| 2e+05| 20000| 30000|
|Alice           |          2|barber            | 5e+03| 7e+03|  2000|     0|

Second pivot: season number第二部pivot:季号

Now for the second pivot, notice that the cst and rev columns are of the form <VARIABLE>_<SEASON NUMBER> -- exactly the specification we'll provide pivot_longer() .现在对于第二个 pivot,请注意cstrev列的格式为<VARIABLE>_<SEASON NUMBER> - 正是我们将提供的规范pivot_longer() This time, it's enough to just specify that an underscore is separating the names (we couldn't last time because there were more than one underscores present in the column names) (apologies again about the table):这一次,只需指定一个下划线来分隔名称就足够了(我们上次不能,因为列名中存在多个下划线)(再次为表格道歉):

survey5 <- survey4 %>%
  pivot_longer(
    matches("cst|rev"), names_to = c(".value", "season_number"),
    names_sep = "_", names_transform = list(season_number = as.numeric)
  )
knitr::kable(survey5)

|respondent_name | job_number|job_name          | season_number|   cst|   rev|
|:---------------|----------:|:-----------------|-------------:|-----:|-----:|
|James           |          1|farmer            |             1| 5e+03| 6e+03|
|James           |          1|farmer            |             2| 2e+03| 3e+03|
|James           |          2|teacher           |             1| 1e+02| 2e+02|
|James           |          2|teacher           |             2| 0e+00| 0e+00|
|Alice           |          1|livestock rearing |             1| 1e+05| 2e+05|
|Alice           |          1|livestock rearing |             2| 2e+04| 3e+04|
|Alice           |          2|barber            |             1| 5e+03| 7e+03|
|Alice           |          2|barber            |             2| 2e+03| 0e+00|

Add in labels and clean添加标签并清洁

Now we can add in the labels stored in the metadata seasonality_desc we made earlier, reorder the columns as desired, and rename "cst" to "cost" and "rev" to "revenue":现在我们可以添加存储在我们之前创建的元数据seasonality_desc中的标签,根据需要重新排序列,并将“cst”重命名为“cost”,将“rev”重命名为“revenue”:

survey6 <- survey5 %>% 
  left_join(seasonality_desc, by = c("job_number", "season_number")) %>% 
  select(
    respondent_name, job_number, job_name, season_number, season, cost = cst, 
    revenue = rev
  )
knitr::kable(survey6)
respondent_name受访者姓名 job_number工作编号 job_name工作名称 season_number季数 season季节 cost成本 revenue收入
James詹姆士 1 1 farmer农民 1 1 high高的 5e+03 5e+03 6e+03 6e+03
James詹姆士 1 1 farmer农民 2 2 low低的 2e+03 2e+03 3e+03 3e+03
James詹姆士 2 2 teacher老师 1 1 active积极的 1e+02 1e+02 2e+02 2e+02
James詹姆士 2 2 teacher老师 2 2 inactive不活跃 0e+00 0e+00 0e+00 0e+00
Alice爱丽丝 1 1 livestock rearing牲畜饲养 1 1 high高的 1e+05 1e+05 2e+05 2e+05
Alice爱丽丝 1 1 livestock rearing牲畜饲养 2 2 low低的 2e+04 2e+04 3e+04 3e+04
Alice爱丽丝 2 2 barber理发师 1 1 active积极的 5e+03 5e+03 7e+03 7e+03
Alice爱丽丝 2 2 barber理发师 2 2 inactive不活跃 2e+03 2e+03 0e+00 0e+00

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM