[英]reshape from wide to long group of variables
This question is very similar to an already existing question .这个问题与一个已经存在的问题非常相似。
However I am unable to extend this to multiple groups of variables.但是我无法将其扩展到多组变量。 This is the dataset I am dealing with这是我正在处理的数据集
A tibble: 12 x 9
Month Cabo_BU_PCT Acapulco_BU_PCT Cabo_LOS_AVG Acapulco_LOS_AVG BED_BUGS_Cabo BED_BUGS_Acapulco TOTAL_OCCUPIED_Cabo TOTAL_OCCUPIED_Acapulco
1 0.6470034 0.6260116 5.223000 4.307667 5 3 19216 6498
2 0.6167027 0.6777457 5.893571 4.247500 3 0 17095 6566
3 0.6372108 0.6348126 5.229677 4.327742 5 1 19556 6809
4 0.6357912 0.6548170 5.356667 4.220000 4 6 18883 6797
5 0.6449006 0.6409659 5.344194 4.162903 2 5 19792 6875
6 0.6747811 0.6935453 5.812667 4.362000 4 3 20041 7199
7 0.6697947 0.6932687 5.544516 4.462903 5 6 20556 7436
8 0.6595960 0.6777923 5.260323 4.135806 0 7 20243 7270
9 0.6792256 0.6863198 5.424333 4.133333 5 0 20173 7124
10 0.6976214 0.7370875 5.419677 4.350000 3 3 21410 7906
11 0.6600337 0.6615607 5.450000 4.184333 3 2 19603 6867
12 0.6761812 0.6773261 5.347097 4.318710 2 2 20752 7265
My goal is to reshape this into a long format like this below, where the columns, Cabo_BU_PCT Acapulco_BU_PCT
are transformed to long format under column name BU_PCT
, similarly columns, Cabo_LOS_AVG Acapulco_LOS_AVG
are transformed to long format under column name LOS_AVG so on.我的目标是将其重塑为如下所示的长格式,其中列Cabo_BU_PCT Acapulco_BU_PCT
被转换为列名BU_PCT
下的长格式,类似的列Cabo_LOS_AVG Acapulco_LOS_AVG
被转换为列名 LOS_AVG 下的长格式等等。
Month Location BU_PCT LOS_AVG BED_BUGS TOTAL_OCCUPIED
1 Cabo 0.6470034 5.223000 5 19216
1 Acapulco 0.6260116 4.307667 3 6498
2 Cabo 0.6167027 5.893571 3 17095
2 Acapulco 0.6777457 4.247500 0 6566
.
.
.
12 Cabo 0.6761812 5.347097 2 20752
12 Acapulco 0.6773261 4.318710 2 7265
Any help in reshaping this dataframe is much appreciated.非常感谢重塑此数据框的任何帮助。 Thanks.谢谢。
======== dataset =========== ======== 数据集 ============
df_wide <- structure(list(Month = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
), Cabo_BU_PCT = c(0.647003367003367, 0.616702741702742, 0.637210817855979,
0.635791245791246, 0.644900619094168, 0.674781144781145, 0.669794721407625,
0.65959595959596, 0.679225589225589, 0.69762137504073, 0.66003367003367,
0.676181166503747), Acapulco_BU_PCT = c(0.626011560693642, 0.677745664739884,
0.634812604885325, 0.654816955684008, 0.640965877307477, 0.69354527938343,
0.693268692895767, 0.677792280440052, 0.686319845857418, 0.737087451053515,
0.661560693641619, 0.677326123438374), Cabo_LOS_AVG = c(5.223,
5.89357142857143, 5.22967741935484, 5.35666666666667, 5.3441935483871,
5.81266666666667, 5.54451612903226, 5.26032258064516, 5.42433333333333,
5.41967741935484, 5.45, 5.34709677419355), Acapulco_LOS_AVG = c(4.30766666666667,
4.2475, 4.32774193548387, 4.22, 4.16290322580645, 4.362, 4.46290322580645,
4.1358064516129, 4.13333333333333, 4.35, 4.18433333333333, 4.31870967741935
), BED_BUGS_Cabo = c(5, 3, 5, 4, 2, 4, 5, 0, 5, 3, 3, 2), BED_BUGS_Acapulco = c(3,
0, 1, 6, 5, 3, 6, 7, 0, 3, 2, 2), TOTAL_OCCUPIED_Cabo = c(19216,
17095, 19556, 18883, 19792, 20041, 20556, 20243, 20173, 21410,
19603, 20752), TOTAL_OCCUPIED_Acapulco = c(6498, 6566, 6809,
6797, 6875, 7199, 7436, 7270, 7124, 7906, 6867, 7265)), class = c("tbl_df",
"tbl", "data.frame"), .Names = c("Month", "Cabo_BU_PCT", "Acapulco_BU_PCT",
"Cabo_LOS_AVG", "Acapulco_LOS_AVG", "BED_BUGS_Cabo", "BED_BUGS_Acapulco",
"TOTAL_OCCUPIED_Cabo", "TOTAL_OCCUPIED_Acapulco"), row.names = c(NA,
-12L))
If you've only got two locations, you can just chuck them in regex, accounting for the fact that they could be at the beginning or end of the name: 如果只有两个位置,则可以将它们放在正则表达式中,考虑到它们可能位于名称的开头或结尾:
library(tidyverse)
df_wide %>%
gather(variable, value, -Month) %>%
mutate(location = sub('.*(Cabo|Acapulco).*', '\\1', variable),
variable = sub('_?(Cabo|Acapulco)_?', '', variable)) %>%
spread(variable, value)
#> # A tibble: 24 x 6
#> Month location BED_BUGS BU_PCT LOS_AVG TOTAL_OCCUPIED
#> * <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Acapulco 3 0.6260116 4.307667 6498
#> 2 1 Cabo 5 0.6470034 5.223000 19216
#> 3 2 Acapulco 0 0.6777457 4.247500 6566
#> 4 2 Cabo 3 0.6167027 5.893571 17095
#> 5 3 Acapulco 1 0.6348126 4.327742 6809
#> 6 3 Cabo 5 0.6372108 5.229677 19556
#> 7 4 Acapulco 6 0.6548170 4.220000 6797
#> 8 4 Cabo 4 0.6357912 5.356667 18883
#> 9 5 Acapulco 5 0.6409659 4.162903 6875
#> 10 5 Cabo 2 0.6449006 5.344194 19792
#> # ... with 14 more rows
This uses reshape
from base R. No packages are used. 这使用从基数R reshape
。不使用任何包装。 varying=
specifies that columns 2 and 3 are to be combined, 4 and 5, etc. The new columns are given the names specified in v.names=
and the locations are specified in times=
. varying=
指定列2和3是要被组合,图4和5等的新的列中给出了指定的名称v.names=
和位置在指定times=
。
We could derive the varying=
, v.names=
and times=
arguments from the headings but it involves a messy regex given their irregularity so it is simpler just to write them out (however, we show how to do it further below). 我们可以从标题中得出v.names=
varying=
, v.names=
和times=
参数,但是由于它们的不规则性,它涉及到一个凌乱的正则表达式,因此只需将它们写出来v.names=
简单(但是,下面将说明如何做)。
The result is ordered by location and then month within location but could be resorted if desired. 结果按位置排序,然后按位置月份排序,但可以根据需要使用。
df_long <- reshape(df_wide, dir = "long",
varying = list(2:3, 4:5, 6:7, 8:9),
v.names = c("BU_OCT", "LOS_AVG", "BED_BUGS", "TOTAL_OCCUPIED"),
times = c("Cabo", "Acupuloc"))[-7]
names(df_long)[2] <- "LOCATION"
Alternately, if we did want to derive varying=
, v.names=
and times=
from names(df_wide)
it could be done like this where names1
is names(df_wide)
without the location names. 或者,如果我们想获得varying=
, v.names=
和times=
从names(df_wide)
它可以这样做,其中names1
是names(df_wide)
没有地点名称。 We use the fact that the location names consist of lower case letters except for the first letter and start or end each name. 我们使用以下事实:位置名称由小写字母组成(首字母除外),每个名称的开头或结尾。
names1 <- names(df_wide)[-1]
pat <- "(.[a-z]+)_(.*)|(.*)_(.[a-z]+)"
varying <- split(names1, sub(pat, "\\2\\3", names1))
v.names <- names(varying)
locations <- unique(sub(pat, "\\1\\4", names1))
df_long <- reshape(df_wide, dir = "long", varying = varying, v.names = v.names,
times = locations)[-7]
names(df_long)[2] <- "LOCATION"
As spread
and gather
are deprecated, I provide an answer based on @alistaire:由于不推荐使用spread
和gather
,我提供了基于@alistaire 的答案:
library(tidyverse)
df_wide %>%
pivot_longer(-Month) %>%
mutate(location = sub('.*(Cabo|Acapulco).*', '\\1', name),
name = sub('_?(Cabo|Acapulco)_?', '', name)) %>%
pivot_wider()
# ------ Outputs below ------
# A tibble: 24 × 6
Month location BU_PCT LOS_AVG BED_BUGS TOTAL_OCCUPIED
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 Cabo 0.647 5.22 5 19216
2 1 Acapulco 0.626 4.31 3 6498
3 2 Cabo 0.617 5.89 3 17095
4 2 Acapulco 0.678 4.25 0 6566
5 3 Cabo 0.637 5.23 5 19556
6 3 Acapulco 0.635 4.33 1 6809
7 4 Cabo 0.636 5.36 4 18883
8 4 Acapulco 0.655 4.22 6 6797
9 5 Cabo 0.645 5.34 2 19792
10 5 Acapulco 0.641 4.16 5 6875
# … with 14 more rows
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.