[英]Reshape large dataset with multiple columns from wide to long
我有一個非常大的數據集,我需要從寬到長重塑。
我的數據集看起來像:
COMPANY PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010 ... REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010 ... COSTSDEC2016
COMPANY A PRODUCT 1 6400 11050 6550 10600 8500 10400 9100 9850
COMPANY A PRODUCT 2 2700 3000 2800 3800 2850 2400 3100 3250
COMPANY B PRODUCT 3 5900 4150 5750 3750 4200 6100 2950 4600
COMPANY B PRODUCT 4 550 600 0 650 200 700 100 500
COMPANY B PRODUCT 5 1500 3750 550 2100 1850 1700 3150 450
COMPANY C PRODUCT 6 19300 17250 23600 21250 18200 26950 18200 23900
我希望它們看起來像:
COMPANY PRODUCT DATE REVENUES COSTS
COMPANY A PRODUCT 1 Dec-16 10600 9850
COMPANY A PRODUCT 1 Feb-10 11050 10400
COMPANY A PRODUCT 1 Jan-10 6400 8500
COMPANY A PRODUCT 1 Mar-10 6550 9100
COMPANY A PRODUCT 2 Dec-16 3800 3250
COMPANY A PRODUCT 2 Feb-10 3000 2400
COMPANY A PRODUCT 2 Jan-10 2700 2850
COMPANY A PRODUCT 2 Mar-10 2800 3100
COMPANY B PRODUCT 3 Dec-16 3750 4600
COMPANY B PRODUCT 3 Feb-10 4150 6100
COMPANY B PRODUCT 3 Jan-10 5900 4200
COMPANY B PRODUCT 3 Mar-10 5750 2950
COMPANY B PRODUCT 4 Dec-16 650 500
COMPANY B PRODUCT 4 Feb-10 600 700
COMPANY B PRODUCT 4 Jan-10 550 200
COMPANY B PRODUCT 4 Mar-10 0 100
COMPANY B PRODUCT 5 Dec-16 2100 450
COMPANY B PRODUCT 5 Feb-10 3750 1700
COMPANY B PRODUCT 5 Jan-10 1500 1850
COMPANY B PRODUCT 5 Mar-10 550 3150
COMPANY C PRODUCT 6 Dec-16 21250 23900
COMPANY C PRODUCT 6 Feb-10 17250 26950
COMPANY C PRODUCT 6 Jan-10 19300 18200
COMPANY C PRODUCT 6 Mar-10 23600 18200
在 Stata 中,我會輸入reshape long REVENUES COSTS, i(COMPANY PRODUCT) j(DATE) string
我如何在 R 中做到這一點?
還有其他幾種方法可以解決這個問題,它們比已經建議的“tidyverse”選項更加精簡。
以下所有示例都使用來自 JMT2080AD 的帶有set.seed(1)
的答案的示例數據(為了可重復性)。
reshape
它並不總是易於使用的功能,但是一旦你弄清楚了, reshape
功能就非常強大。 在這種情況下,您沒有sep
,這會使事情變得有點棘手,因為您必須更具體地了解結果變量名稱和應顯示為“時間”的值(默認情況下) ,他們只是序列號)。
times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
reshape(yourData, direction = "long",
varying = grep("revenues|cost", names(yourData)), sep = "",
v.names = c("revenues", "cost"), timevar = "date", times = times)
# company product date revenues cost id
# 1.Jan2010 Company A Product 1 Jan2010 2862 1164 1
# 2.Jan2010 Company A Product 2 Jan2010 2152 1430 2
# 3.Jan2010 Company B Product 3 Jan2010 2073 1932 3
# 4.Jan2010 Company B Product 4 Jan2010 654 2771 4
# 5.Jan2010 Company B Product 5 Jan2010 1015 1004 5
# 6.Jan2010 Company C Product 6 Jan2010 941 2746 6
# ....
這幾乎就是您要查找的內容,也許日期格式略有不同。
data.table
如果性能是您所追求的,您可以從“data.table”中查看melt
,您應該可以使用它執行以下操作。 與reshape
方法一樣,您需要存儲“時間”以在melt
數據后重新引入日期。
(注意:我知道這與@Uwe 的方法非常相似。)
library(data.table)
times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
melt(as.data.table(yourData), measure.vars = patterns("revenues", "cost"),
value.name = c("revenues", "cost"))[
, variable := factor(variable, labels = times)][]
# company product variable revenues cost
# 1: Company A Product 1 Jan2010 1164 1168
# 2: Company A Product 2 Jan2010 1430 1465
# 3: Company B Product 3 Jan2010 1932 533
# 4: Company B Product 4 Jan2010 2771 1456
# 5: Company B Product 5 Jan2010 1004 2674
# ---
# 20: Company A Product 2 Apr2010 2444 1883
# 21: Company B Product 3 Apr2010 2837 1824
# 22: Company B Product 4 Apr2010 1030 2473
# 23: Company B Product 5 Apr2010 2129 558
# 24: Company C Product 6 Apr2010 814 1693
merged.stack
我的“splitstackshape”pacakge 有一個名為merged.stack
的函數,它試圖使這種特殊的整形更容易進行。 有了它,你可以嘗試:
library(splitstackshape)
merged.stack(yourData, var.stubs = c("revenues", "cost"), sep = "var.stubs")
# company product .time_1 revenues cost
# 1: Company A Product 1 Apr2010 1450 2457
# 2: Company A Product 1 Feb2010 2862 1705
# 3: Company A Product 1 Jan2010 1164 1168
# 4: Company A Product 1 Mar2010 2218 2486
# 5: Company A Product 2 Apr2010 2444 1883
# 6: Company A Product 2 Feb2010 2152 1999
# 7: Company A Product 2 Jan2010 1430 1465
# 8: Company A Product 2 Mar2010 1460 770
# 9: Company B Product 3 Apr2010 2837 1824
# 10: Company B Product 3 Feb2010 2073 1734
# ...
有一天,我會找時間更新功能,這是以前寫的melt
在“data.table”可以處理一個半寬的輸出格式。 我已經想出了一個部分解決方案,但后來我不再擺弄它了。
事實上,使用上面的鏈接函數,解決方案很簡單:
ReshapeLong_(yourData, c("revenues", "cost"))
extract
使用 tidyverse 的其他解決方案似乎以一種非常奇怪的方式處理事情。 更好的解決方案是使用extract
將您需要的數據放入新列中。 你必須先gather
數據到一個很長的格式,然后spread
的數據成為一個廣泛的格式。
這是我將使用的方法:
library(tidyverse)
yourData %>%
gather(var, val, -company, -product) %>%
extract(var, into = c("type", "month", "year"),
regex = ("(revenues|cost)(...)(.*)")) %>%
spread(type, val)
# company product month year cost revenues
# 1 Company A Product 1 Apr 2010 2457 1450
# 2 Company A Product 1 Feb 2010 1705 2862
# 3 Company A Product 1 Jan 2010 1168 1164
# 4 Company A Product 1 Mar 2010 2486 2218
# 5 Company A Product 2 Apr 2010 1883 2444
# 6 Company A Product 2 Feb 2010 1999 2152
# ...
這里的棘手之處在於您將日期打包到列名中。 在您按照自己的意願制作表格之前,必須先解析這些內容。 我遍歷了每一列,解析每個子表列名稱的日期和觀察類型,綁定每個子表,然后轉換成本/收入。 我相信那里有一個更優雅的解決方案。
library(reshape)
## making a table similar to yours here
yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
product = paste("Product", 1:6),
revenuesJan2010 = round(runif(6, 500, 3000)),
revenuesFeb2010 = round(runif(6, 500, 3000)),
revenuesMar2010 = round(runif(6, 500, 3000)),
revenuesApr2010 = round(runif(6, 500, 3000)),
costJan2010 = round(runif(6, 500, 3000)),
costFeb2010 = round(runif(6, 500, 3000)),
costMar2010 = round(runif(6, 500, 3000)),
costApr2010 = round(runif(6, 500, 3000)))
## a function that parses the date from the column name
columnParse <- function(tab){
colNm <- names(tab)[3]
names(tab)[3] <- "value"
colDate <- strsplit(colNm, "revenues|cost")[[1]][2]
colDate <- gsub("([A-Za-z]+)", "\\1-", colDate)
tab$date <- colDate
tab$type <- gsub("(revenues|cost).*", "\\1", colNm)
return(tab)
}
## running that function against sub tables of your data, then binding
yourDataLong <- do.call(rbind,
lapply(3:ncol(yourData),
function(x) columnParse(yourData[c(1:2, x)])))
## casting your data on cost/revenue
yourDataCast <- cast(yourDataLong, company+product+date~type, value = "value")
這是使用tidyverse
和stringr
的另一個選項:
yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
product = paste("Product", 1:6),
REVENUESJan2010 = round(runif(6, 500, 3000)),
REVENUESFeb2010 = round(runif(6, 500, 3000)),
REVENUESMar2010 = round(runif(6, 500, 3000)),
REVENUESApr2010 = round(runif(6, 500, 3000)),
COSTSJan2010 = round(runif(6, 500, 3000)),
COSTSFeb2010 = round(runif(6, 500, 3000)),
COSTSMar2010 = round(runif(6, 500, 3000)),
COSTSApr2010 = round(runif(6, 500, 3000)))
使用tidyverse
和stringr
的解決方案:
library(tidyverse)
library(stringr)
newData <- yourData %>%
gather(key = rev.cost.date, value, -company, -product) %>%
mutate(finance.type = ifelse(str_detect(rev.cost.date, fixed("REVENUES")), "REVENUES", "COSTS")) %>%
mutate(date = str_replace(rev.cost.date, "REVENUES|COSTS", "")) %>%
select(-rev.cost.date) %>%
spread(value = value, key = finance.type) %>%
mutate(date = paste0(str_sub(date, 0, 3), "-", str_sub(date, 4,8))
從 1.9.6 版(CRAN 2015 年 9 月 19 日)開始, data.table
可以同時融合多個列(使用patterns()
函數)。 因此,以REVENUES
和COSTS
開頭的列可以收集到兩個單獨的列中。
此外,日期(月份)被打包到沒有分隔符的列名中。 這些是使用帶有后視的正則表達式從列名中提取的,用於替換DATE
列的因子水平。
library(data.table)
library(magrittr)
cols <- c("REVENUES", "COSTS")
long <- melt(wide, measure.vars = patterns(cols), value.name = cols, variable.name = "DATE")
months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit()
long[, DATE := forcats::lvls_revalue(DATE, months)]
long
COMPANY PRODUCT DATE REVENUES COSTS 1: COMPANY A PRODUCT 1 JAN2010 6400 8500 2: COMPANY A PRODUCT 2 JAN2010 2700 2850 3: COMPANY B PRODUCT 3 JAN2010 5900 4200 4: COMPANY B PRODUCT 4 JAN2010 550 200 5: COMPANY B PRODUCT 5 JAN2010 1500 1850 6: COMPANY C PRODUCT 6 JAN2010 19300 18200 7: COMPANY A PRODUCT 1 FEB2010 11050 10400 8: COMPANY A PRODUCT 2 FEB2010 3000 2400 9: COMPANY B PRODUCT 3 FEB2010 4150 6100 10: COMPANY B PRODUCT 4 FEB2010 600 700 11: COMPANY B PRODUCT 5 FEB2010 3750 1700 12: COMPANY C PRODUCT 6 FEB2010 17250 26950 13: COMPANY A PRODUCT 1 MARCH2010 6550 9100 14: COMPANY A PRODUCT 2 MARCH2010 2800 3100 15: COMPANY B PRODUCT 3 MARCH2010 5750 2950 16: COMPANY B PRODUCT 4 MARCH2010 0 100 17: COMPANY B PRODUCT 5 MARCH2010 550 3150 18: COMPANY C PRODUCT 6 MARCH2010 23600 18200 19: COMPANY A PRODUCT 1 DEC2016 10600 9850 20: COMPANY A PRODUCT 2 DEC2016 3800 3250 21: COMPANY B PRODUCT 3 DEC2016 3750 4600 22: COMPANY B PRODUCT 4 DEC2016 650 500 23: COMPANY B PRODUCT 5 DEC2016 2100 450 24: COMPANY C PRODUCT 6 DEC2016 21250 23900 COMPANY PRODUCT DATE REVENUES COSTS
使用字母月份名稱和年份的命名方案不允許按DATE
正確排序數據。 DEC2016
在FEB2010
之前, FEB2010
在JAN2010
之前。 ISO 8601 命名約定將年份放在首位,然后是月份數。
我們可以使用這個命名方案如下:
months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit() %>%
paste0("01", .) %>% lubridate::dmy() %>% format("%Y-%m")
long[, DATE := forcats::lvls_revalue(DATE, months)]
long
COMPANY PRODUCT DATE REVENUES COSTS 1: COMPANY A PRODUCT 1 2010-01 6400 8500 2: COMPANY A PRODUCT 2 2010-01 2700 2850 3: COMPANY B PRODUCT 3 2010-01 5900 4200 4: COMPANY B PRODUCT 4 2010-01 550 200 5: COMPANY B PRODUCT 5 2010-01 1500 1850 6: COMPANY C PRODUCT 6 2010-01 19300 18200 7: COMPANY A PRODUCT 1 2010-02 11050 10400 8: COMPANY A PRODUCT 2 2010-02 3000 2400 9: COMPANY B PRODUCT 3 2010-02 4150 6100 10: COMPANY B PRODUCT 4 2010-02 600 700 11: COMPANY B PRODUCT 5 2010-02 3750 1700 12: COMPANY C PRODUCT 6 2010-02 17250 26950 13: COMPANY A PRODUCT 1 2010-03 6550 9100 14: COMPANY A PRODUCT 2 2010-03 2800 3100 15: COMPANY B PRODUCT 3 2010-03 5750 2950 16: COMPANY B PRODUCT 4 2010-03 0 100 17: COMPANY B PRODUCT 5 2010-03 550 3150 18: COMPANY C PRODUCT 6 2010-03 23600 18200 19: COMPANY A PRODUCT 1 2016-12 10600 9850 20: COMPANY A PRODUCT 2 2016-12 3800 3250 21: COMPANY B PRODUCT 3 2016-12 3750 4600 22: COMPANY B PRODUCT 4 2016-12 650 500 23: COMPANY B PRODUCT 5 2016-12 2100 450 24: COMPANY C PRODUCT 6 2016-12 21250 23900 COMPANY PRODUCT DATE REVENUES COSTS
library(data.table)
wide <- data.table(
readr::read_table(
" COMPANY PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010 REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010 COSTSDEC2016
COMPANY A PRODUCT 1 6400 11050 6550 10600 8500 10400 9100 9850
COMPANY A PRODUCT 2 2700 3000 2800 3800 2850 2400 3100 3250
COMPANY B PRODUCT 3 5900 4150 5750 3750 4200 6100 2950 4600
COMPANY B PRODUCT 4 550 600 0 650 200 700 100 500
COMPANY B PRODUCT 5 1500 3750 550 2100 1850 1700 3150 450
COMPANY C PRODUCT 6 19300 17250 23600 21250 18200 26950 18200 23900"
))
我認為最明確的(即不需要重命名變量)在 R 中從寬到長的方法是使用基本的 R reshape()
函數,並指定要“堆疊”為list
的不同列。 請參閱此博客文章。
我將使用JMT2080AD 答案中的數據並將種子設置為set.seed(789)
。
### Create a list of the variables you want to reshape/stack
reshape.vars <- list(c("revenuesJan2010", "revenuesFeb2010", "revenuesMar2010", "revenuesApr2010"), # revenues
c("costJan2010", "costFeb2010", "costMar2010", "costApr2010")) # cost
### reshape wide to long
reshape(yourData, #dataframe
direction="long", #wide to long
varying=reshape.vars, #repeated measures list of indexes for vars to stack/reshape
timevar="date", #the repeated measures times
v.names=c("revenues", "cost")) #the repeated measures names
# company product date revenues cost id
# 1.1 Company A Product 1 1 2250 1574 1
# 2.1 Company A Product 2 1 734 1793 2
# 3.1 Company B Product 3 1 530 1282 3
# 4.1 Company B Product 4 1 1979 1741 4
# 5.1 Company B Product 5 1 1730 2558 5
# 6.1 Company C Product 6 1 550 1757 6
# 1.2 Company A Product 1 2 1932 1048 1
#...
# 5.3 Company B Product 5 3 890 1103 5
# 6.3 Company C Product 6 3 2113 2469 6
# 1.4 Company A Product 1 4 2426 2382 1
# 2.4 Company A Product 2 4 778 2995 2
# 3.4 Company B Product 3 4 1359 989 3
# 4.4 Company B Product 4 4 1618 912 4
# 5.4 Company B Product 5 4 895 2109 5
# 6.4 Company C Product 6 4 1258 2803 6
使用list
方法
reshape()
推斷應堆疊哪些變量不會出現錯誤我發現即使有 100 多個變量要重塑,如果重命名它們可能很麻煩,那么使用復制/粘貼來創建可變變量列表也不會花費那么長時間。
作為喜歡在 stata 中重塑的 stata 到 r 轉換者,我發現 tidyr::gather 和 tidyr::spread 非常直觀。 聚集基本上是重塑長,傳播是重塑寬。
這是將您的數據更改為您想要的方式的代碼:
new_data <-
gather(data = your-data-frame,
key = var_holder,
value = val_holder,
-company,
-product)
new_data$var_holder <- sub("REVENUE", "cost_", new_data$var_holder)
new_data$var_holder <- sub("COST", "cost_", new_data$var_holder)
new_data <-
separate(data = new_data,
col = var_holder,
into = c("var", "date")) %>%
spread(key = var,
value = val_holder)
並做了!
收集通過獲取所有指定的變量名稱(或在此未指定,注意前面有“-”符號的兩個變量)來工作,並將它們放在名稱由“key = ...”指定的新變量下(在進行時創建新行)。 然后,它獲取屬於這些變量的值,並將它們放在名稱由“value = ...”指定的單個變量下。
傳播在相反的方向起作用。 希望這可以幫助!
使用tidyr
的開發版本的tidyr
(版本 - '0.8.3.9000')
library(dplyr)
library(tidyr)
library(stringr)
library(zoo)
library(readr)
df1 %>%
rename_at(3:ncol(.), ~ str_replace(., "^(REVENUES|COSTS)", "\\1_")) %>%
pivot_longer(c(-COMPANY, -PRODUCT), names_to = c(".value", "DATE"), names_sep = "_") %>%
mutate(DATE = format(as.yearmon(DATE), "%b-%Y"))
# A tibble: 24 x 5
# COMPANY PRODUCT DATE REVENUES COSTS
# <chr> <chr> <chr> <dbl> <dbl>
# 1 COMPANY A PRODUCT 1 Jan-2010 6400 8500
# 2 COMPANY A PRODUCT 1 Feb-2010 11050 10400
# 3 COMPANY A PRODUCT 1 Mar-2010 6550 9100
# 4 COMPANY A PRODUCT 1 Dec-2016 10600 9850
# 5 COMPANY A PRODUCT 2 Jan-2010 2700 2850
# 6 COMPANY A PRODUCT 2 Feb-2010 3000 2400
# 7 COMPANY A PRODUCT 2 Mar-2010 2800 3100
# 8 COMPANY A PRODUCT 2 Dec-2016 3800 3250
# 9 COMPANY B PRODUCT 3 Jan-2010 5900 4200
#10 COMPANY B PRODUCT 3 Feb-2010 4150 6100
# … with 14 more rows
df1 <- structure(list(COMPANY = c("COMPANY A", "COMPANY A", "COMPANY B",
"COMPANY B", "COMPANY B", "COMPANY C"), PRODUCT = c("PRODUCT 1",
"PRODUCT 2", "PRODUCT 3", "PRODUCT 4", "PRODUCT 5", "PRODUCT 6"
), REVENUESJAN2010 = c(6400, 2700, 5900, 550, 1500, 19300), REVENUESFEB2010 = c(11050,
3000, 4150, 600, 3750, 17250), REVENUESMARCH2010 = c(6550, 2800,
5750, 0, 550, 23600), REVENUESDEC2016 = c(10600, 3800, 3750,
650, 2100, 21250), COSTSJAN2010 = c(8500, 2850, 4200, 200, 1850,
18200), COSTSFEB2010 = c(10400, 2400, 6100, 700, 1700, 26950),
COSTSMARCH2010 = c(9100, 3100, 2950, 100, 3150, 18200), COSTSDEC2016 = c(9850,
3250, 4600, 500, 450, 23900)), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(
cols = list(COMPANY = structure(list(), class = c("collector_character",
"collector")), PRODUCT = structure(list(), class = c("collector_character",
"collector")), REVENUESJAN2010 = structure(list(), class = c("collector_double",
"collector")), REVENUESFEB2010 = structure(list(), class = c("collector_double",
"collector")), REVENUESMARCH2010 = structure(list(), class = c("collector_double",
"collector")), REVENUESDEC2016 = structure(list(), class = c("collector_double",
"collector")), COSTSJAN2010 = structure(list(), class = c("collector_double",
"collector")), COSTSFEB2010 = structure(list(), class = c("collector_double",
"collector")), COSTSMARCH2010 = structure(list(), class = c("collector_double",
"collector")), COSTSDEC2016 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.