简体   繁体   English

如何在不使用长变量值作为新变量名的情况下将 R dataframe 从长改型为宽?

[英]How do I reshape an R dataframe from long to wide without using the long variable values as the new variable names?

I have a long dataframe that lists the top 3 employers of each occupation code (3 rows per occupation code).我有一个很长的 dataframe 列出了每个职业代码的前 3 名雇主(每个职业代码 3 行)。 It looks like this.它看起来像这样。

occcode occcode employer雇主
1 1 top employer for occcode1 occcode1 的最佳雇主
1 1 2nd employer for occcode 1 occcode 1 的第二个雇主
1 1 3rd employer for occcode 1 occcode 1 的第三个雇主
2 2 top employer for occcode2 occcode2 的最佳雇主
2 2 2nd employer for occcode 2 occcode 2 的第二个雇主
2 2 3rd employer for occcode 1 occcode 1 的第三个雇主

I want to reshape it so that I have one row per occupation code, and columns named "emp1", "emp2", and "emp3" that are respectively populated with the 1st-3rd employers of that occupation code.我想重塑它,以便每个职业代码有一行,以及名为“emp1”、“emp2”和“emp3”的列,分别填充该职业代码的第 1-3 个雇主。

occcode occcode employer1雇主1 employer2雇主2 employer3雇主3
1 1 top employer for occcode1 occcode1 的最佳雇主 2nd employer for occcode 1 occcode 1 的第二个雇主 3rd employer for occcode 1 occcode 1 的第三个雇主
2 2 top employer for occcode2 occcode2 的最佳雇主 2nd employer for occode2 occode2 的第二个雇主 3rd employer for occcode 1 occcode 1 的第三个雇主

I previously thought using the spread() function would work.我以前认为使用spread() function 会起作用。 But reading the documentation and testing it out, it doesn't produce what I have in mind because it requires that the values in "employer" in the long version of the data be standardized (such that there are only 3 employer names);但是阅读文档并对其进行测试,它并没有产生我的想法,因为它要求将长版本数据中“雇主”中的值标准化(这样只有 3 个雇主名称); that's not the case because employer names vary a lot across occupation codes.情况并非如此,因为雇主名称在不同职业代码中差异很大。 What is the best way to do reshape the data in line with what I need?根据我的需要重塑数据的最佳方法是什么?

I removed the last row of source data to show that this should work for variable numbers of employers per occcode:我删除了最后一行源数据,以表明这应该适用于每个 occcode 的可变数量的雇主:

library(tidyverse)      
data.frame(
  stringsAsFactors = FALSE,
           occcode = c(1L, 1L, 1L, 2L, 2L),
          employer = c("top employer for occcode1",
                       "2nd employer for occcode 1","3rd employer for occcode 1",
                       "top employer for occcode2",
                       "2nd employer for occcode 2")
) %>%
  
  group_by(occcode) %>%
  mutate(col = paste0("employer", row_number())) %>%
  ungroup() %>%
  pivot_wider(names_from = col, values_from = employer)

Result结果

# A tibble: 2 × 4
  occcode employer1                 employer2                  employer3                 
    <int> <chr>                     <chr>                      <chr>                     
1       1 top employer for occcode1 2nd employer for occcode 1 3rd employer for occcode 1
2       2 top employer for occcode2 2nd employer for occcode 2 NA  

Here is another approach:这是另一种方法:

library(data.table)
dcast(
  setDT(df)[, emp:={emp=substr(employer,1,1);emp=paste0("employer",fifelse(emp=="t","1",emp))}],
  occcode~emp, value.var="employer"
)

Output: Output:

   occcode                 employer1                  employer2                  employer3
1:       1 top employer for occcode1 2nd employer for occcode 1 3rd employer for occcode 1
2:       2 top employer for occcode2 2nd employer for occcode 2 3rd employer for occcode 2

Input:输入:

structure(list(occcode = c(1L, 1L, 1L, 2L, 2L, 2L), employer = c("top employer for occcode1", 
"2nd employer for occcode 1", "3rd employer for occcode 1", "top employer for occcode2", 
"2nd employer for occcode 2", "3rd employer for occcode 2")), row.names = c(NA, 
-6L), class = "data.frame")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM