简体   繁体   English

从宽到长的变量组重塑

[英]reshape from wide to long group of variables

This question is very similar to an already existing question .这个问题与一个已经存在的问题非常相似。

However I am unable to extend this to multiple groups of variables.但是我无法将其扩展到多组变量。 This is the dataset I am dealing with这是我正在处理的数据集

A tibble: 12 x 9
   Month Cabo_BU_PCT Acapulco_BU_PCT Cabo_LOS_AVG Acapulco_LOS_AVG BED_BUGS_Cabo BED_BUGS_Acapulco TOTAL_OCCUPIED_Cabo TOTAL_OCCUPIED_Acapulco

       1   0.6470034       0.6260116     5.223000         4.307667             5                 3               19216                    6498
       2   0.6167027       0.6777457     5.893571         4.247500             3                 0               17095                    6566
       3   0.6372108       0.6348126     5.229677         4.327742             5                 1               19556                    6809
       4   0.6357912       0.6548170     5.356667         4.220000             4                 6               18883                    6797
       5   0.6449006       0.6409659     5.344194         4.162903             2                 5               19792                    6875
       6   0.6747811       0.6935453     5.812667         4.362000             4                 3               20041                    7199
       7   0.6697947       0.6932687     5.544516         4.462903             5                 6               20556                    7436
       8   0.6595960       0.6777923     5.260323         4.135806             0                 7               20243                    7270
       9   0.6792256       0.6863198     5.424333         4.133333             5                 0               20173                    7124
      10   0.6976214       0.7370875     5.419677         4.350000             3                 3               21410                    7906
      11   0.6600337       0.6615607     5.450000         4.184333             3                 2               19603                    6867
      12   0.6761812       0.6773261     5.347097         4.318710             2                 2               20752                    7265

My goal is to reshape this into a long format like this below, where the columns, Cabo_BU_PCT Acapulco_BU_PCT are transformed to long format under column name BU_PCT , similarly columns, Cabo_LOS_AVG Acapulco_LOS_AVG are transformed to long format under column name LOS_AVG so on.我的目标是将其重塑为如下所示的长格式,其中列Cabo_BU_PCT Acapulco_BU_PCT被转换为列名BU_PCT下的长格式,类似的列Cabo_LOS_AVG Acapulco_LOS_AVG被转换为列名 LOS_AVG 下的长格式等等。

  Month    Location    BU_PCT      LOS_AVG     BED_BUGS       TOTAL_OCCUPIED
  1        Cabo        0.6470034   5.223000    5              19216
  1        Acapulco    0.6260116   4.307667    3              6498
  2        Cabo        0.6167027   5.893571    3              17095
  2        Acapulco    0.6777457   4.247500    0              6566
  .
  .
  .
  12       Cabo        0.6761812   5.347097    2              20752
  12       Acapulco    0.6773261   4.318710    2              7265  

Any help in reshaping this dataframe is much appreciated.非常感谢重塑此数据框的任何帮助。 Thanks.谢谢。

======== dataset =========== ======== 数据集 ============

df_wide <- structure(list(Month = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
), Cabo_BU_PCT = c(0.647003367003367, 0.616702741702742, 0.637210817855979, 
0.635791245791246, 0.644900619094168, 0.674781144781145, 0.669794721407625, 
0.65959595959596, 0.679225589225589, 0.69762137504073, 0.66003367003367, 
0.676181166503747), Acapulco_BU_PCT = c(0.626011560693642, 0.677745664739884, 
0.634812604885325, 0.654816955684008, 0.640965877307477, 0.69354527938343, 
0.693268692895767, 0.677792280440052, 0.686319845857418, 0.737087451053515, 
0.661560693641619, 0.677326123438374), Cabo_LOS_AVG = c(5.223, 
5.89357142857143, 5.22967741935484, 5.35666666666667, 5.3441935483871, 
5.81266666666667, 5.54451612903226, 5.26032258064516, 5.42433333333333, 
5.41967741935484, 5.45, 5.34709677419355), Acapulco_LOS_AVG = c(4.30766666666667, 
4.2475, 4.32774193548387, 4.22, 4.16290322580645, 4.362, 4.46290322580645, 
4.1358064516129, 4.13333333333333, 4.35, 4.18433333333333, 4.31870967741935
), BED_BUGS_Cabo = c(5, 3, 5, 4, 2, 4, 5, 0, 5, 3, 3, 2), BED_BUGS_Acapulco = c(3, 
0, 1, 6, 5, 3, 6, 7, 0, 3, 2, 2), TOTAL_OCCUPIED_Cabo = c(19216, 
17095, 19556, 18883, 19792, 20041, 20556, 20243, 20173, 21410, 
19603, 20752), TOTAL_OCCUPIED_Acapulco = c(6498, 6566, 6809, 
6797, 6875, 7199, 7436, 7270, 7124, 7906, 6867, 7265)), class = c("tbl_df", 
"tbl", "data.frame"), .Names = c("Month", "Cabo_BU_PCT", "Acapulco_BU_PCT", 
"Cabo_LOS_AVG", "Acapulco_LOS_AVG", "BED_BUGS_Cabo", "BED_BUGS_Acapulco", 
"TOTAL_OCCUPIED_Cabo", "TOTAL_OCCUPIED_Acapulco"), row.names = c(NA, 
-12L))

If you've only got two locations, you can just chuck them in regex, accounting for the fact that they could be at the beginning or end of the name: 如果只有两个位置,则可以将它们放在正则表达式中,考虑到它们可能位于名称的开头或结尾:

library(tidyverse)

df_wide %>% 
    gather(variable, value, -Month) %>% 
    mutate(location = sub('.*(Cabo|Acapulco).*', '\\1', variable), 
           variable = sub('_?(Cabo|Acapulco)_?', '', variable)) %>% 
    spread(variable, value)
#> # A tibble: 24 x 6
#>    Month location BED_BUGS    BU_PCT  LOS_AVG TOTAL_OCCUPIED
#>  * <dbl>    <chr>    <dbl>     <dbl>    <dbl>          <dbl>
#>  1     1 Acapulco        3 0.6260116 4.307667           6498
#>  2     1     Cabo        5 0.6470034 5.223000          19216
#>  3     2 Acapulco        0 0.6777457 4.247500           6566
#>  4     2     Cabo        3 0.6167027 5.893571          17095
#>  5     3 Acapulco        1 0.6348126 4.327742           6809
#>  6     3     Cabo        5 0.6372108 5.229677          19556
#>  7     4 Acapulco        6 0.6548170 4.220000           6797
#>  8     4     Cabo        4 0.6357912 5.356667          18883
#>  9     5 Acapulco        5 0.6409659 4.162903           6875
#> 10     5     Cabo        2 0.6449006 5.344194          19792
#> # ... with 14 more rows

This uses reshape from base R. No packages are used. 这使用从基数R reshape 。不使用任何包装。 varying= specifies that columns 2 and 3 are to be combined, 4 and 5, etc. The new columns are given the names specified in v.names= and the locations are specified in times= . varying=指定列2和3是要被组合,图4和5等的新的列中给出了指定的名称v.names=和位置在指定times=

We could derive the varying= , v.names= and times= arguments from the headings but it involves a messy regex given their irregularity so it is simpler just to write them out (however, we show how to do it further below). 我们可以从标题中得出v.names= varying=v.names=times=参数,但是由于它们的不规则性,它涉及到一个凌乱的正则表达式,因此只需将它们写出来v.names=简单(但是,下面将说明如何做)。

The result is ordered by location and then month within location but could be resorted if desired. 结果按位置排序,然后按位置月份排序,但可以根据需要使用。

df_long <- reshape(df_wide, dir = "long", 
 varying = list(2:3, 4:5, 6:7, 8:9),
 v.names = c("BU_OCT", "LOS_AVG", "BED_BUGS", "TOTAL_OCCUPIED"),
 times = c("Cabo", "Acupuloc"))[-7]
names(df_long)[2] <- "LOCATION"

Alternately, if we did want to derive varying= , v.names= and times= from names(df_wide) it could be done like this where names1 is names(df_wide) without the location names. 或者,如果我们想获得varying=v.names=times=names(df_wide)它可以这样做,其中names1names(df_wide)没有地点名称。 We use the fact that the location names consist of lower case letters except for the first letter and start or end each name. 我们使用以下事实:位置名称由小写字母组成(首字母除外),每个名称的开头或结尾。

names1 <- names(df_wide)[-1]
pat <- "(.[a-z]+)_(.*)|(.*)_(.[a-z]+)"
varying <- split(names1, sub(pat, "\\2\\3", names1))
v.names <- names(varying)
locations <- unique(sub(pat, "\\1\\4", names1))

df_long <- reshape(df_wide, dir = "long", varying = varying, v.names = v.names, 
     times = locations)[-7]
names(df_long)[2] <- "LOCATION"

As spread and gather are deprecated, I provide an answer based on @alistaire:由于不推荐使用spreadgather ,我提供了基于@alistaire 的答案:

library(tidyverse)
df_wide %>%  
  pivot_longer(-Month)  %>%  
  mutate(location = sub('.*(Cabo|Acapulco).*', '\\1', name), 
         name  = sub('_?(Cabo|Acapulco)_?', '', name)) %>% 
  pivot_wider()
# ------ Outputs below ------
# A tibble: 24 × 6
   Month location BU_PCT LOS_AVG BED_BUGS TOTAL_OCCUPIED
   <dbl> <chr>     <dbl>   <dbl>    <dbl>          <dbl>
 1     1 Cabo      0.647    5.22        5          19216
 2     1 Acapulco  0.626    4.31        3           6498
 3     2 Cabo      0.617    5.89        3          17095
 4     2 Acapulco  0.678    4.25        0           6566
 5     3 Cabo      0.637    5.23        5          19556
 6     3 Acapulco  0.635    4.33        1           6809
 7     4 Cabo      0.636    5.36        4          18883
 8     4 Acapulco  0.655    4.22        6           6797
 9     5 Cabo      0.645    5.34        2          19792
10     5 Acapulco  0.641    4.16        5           6875
# … with 14 more rows

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM