簡體   English   中英

用多列從寬到長重塑大型數據集

[英]Reshape large dataset with multiple columns from wide to long

我有一個非常大的數據集,我需要從寬到長重塑。

我的數據集看起來像:

  COMPANY   PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010 ... REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010 ... COSTSDEC2016
COMPANY A PRODUCT 1            6400           11050              6550               10600         8500        10400           9100             9850
COMPANY A PRODUCT 2            2700            3000              2800                3800         2850         2400           3100             3250
COMPANY B PRODUCT 3            5900            4150              5750                3750         4200         6100           2950             4600
COMPANY B PRODUCT 4             550             600                 0                 650          200          700            100              500
COMPANY B PRODUCT 5            1500            3750               550                2100         1850         1700           3150              450
COMPANY C PRODUCT 6           19300           17250             23600               21250        18200        26950          18200            23900

我希望它們看起來像:

  COMPANY    PRODUCT    DATE  REVENUES  COSTS
COMPANY A  PRODUCT 1  Dec-16     10600   9850
COMPANY A  PRODUCT 1  Feb-10     11050  10400
COMPANY A  PRODUCT 1  Jan-10      6400   8500
COMPANY A  PRODUCT 1  Mar-10      6550   9100
COMPANY A  PRODUCT 2  Dec-16      3800   3250
COMPANY A  PRODUCT 2  Feb-10      3000   2400
COMPANY A  PRODUCT 2  Jan-10      2700   2850
COMPANY A  PRODUCT 2  Mar-10      2800   3100
COMPANY B  PRODUCT 3  Dec-16      3750   4600
COMPANY B  PRODUCT 3  Feb-10      4150   6100
COMPANY B  PRODUCT 3  Jan-10      5900   4200
COMPANY B  PRODUCT 3  Mar-10      5750   2950
COMPANY B  PRODUCT 4  Dec-16       650    500
COMPANY B  PRODUCT 4  Feb-10       600    700
COMPANY B  PRODUCT 4  Jan-10       550    200
COMPANY B  PRODUCT 4  Mar-10         0    100
COMPANY B  PRODUCT 5  Dec-16      2100    450
COMPANY B  PRODUCT 5  Feb-10      3750   1700
COMPANY B  PRODUCT 5  Jan-10      1500   1850
COMPANY B  PRODUCT 5  Mar-10       550   3150
COMPANY C  PRODUCT 6  Dec-16     21250  23900
COMPANY C  PRODUCT 6  Feb-10     17250  26950
COMPANY C  PRODUCT 6  Jan-10     19300  18200
COMPANY C  PRODUCT 6  Mar-10     23600  18200

在 Stata 中,我會輸入reshape long REVENUES COSTS, i(COMPANY PRODUCT) j(DATE) string

我如何在 R 中做到這一點?

還有其他幾種方法可以解決這個問題,它們比已經建議的“tidyverse”選項更加精簡。

以下所有示例都使用來自 JMT2080AD 的帶有set.seed(1)的答案的示例數據(為了可重復性)。

選項 1:基礎 R 的reshape

它並不總是易於使用的功能,但是一旦你弄清楚了, reshape功能就非常強大。 在這種情況下,您沒有sep ,這會使事情變得有點棘手,因為您必須更具體地了解結果變量名稱和應顯示為“時間”的值(默認情況下) ,他們只是序列號)。

times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
reshape(yourData, direction = "long", 
        varying = grep("revenues|cost", names(yourData)), sep = "", 
        v.names = c("revenues", "cost"), timevar = "date", times = times)
#             company   product    date revenues cost id
# 1.Jan2010 Company A Product 1 Jan2010     2862 1164  1
# 2.Jan2010 Company A Product 2 Jan2010     2152 1430  2
# 3.Jan2010 Company B Product 3 Jan2010     2073 1932  3
# 4.Jan2010 Company B Product 4 Jan2010      654 2771  4
# 5.Jan2010 Company B Product 5 Jan2010     1015 1004  5
# 6.Jan2010 Company C Product 6 Jan2010      941 2746  6
# ....

這幾乎就是您要查找的內容,也許日期格式略有不同。

選項 2: data.table

如果性能是您所追求的,您可以從“data.table”中查看melt ,您應該可以使用它執行以下操作。 reshape方法一樣,您需要存儲“時間”以在melt數據后重新引入日期。

(注意:我知道這與@Uwe 的方法非常相似。)

library(data.table)
times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
melt(as.data.table(yourData), measure.vars = patterns("revenues", "cost"),
     value.name = c("revenues", "cost"))[
       , variable := factor(variable, labels = times)][]
#       company   product variable revenues cost
#  1: Company A Product 1  Jan2010     1164 1168
#  2: Company A Product 2  Jan2010     1430 1465
#  3: Company B Product 3  Jan2010     1932  533
#  4: Company B Product 4  Jan2010     2771 1456
#  5: Company B Product 5  Jan2010     1004 2674
# ---                                           
# 20: Company A Product 2  Apr2010     2444 1883
# 21: Company B Product 3  Apr2010     2837 1824
# 22: Company B Product 4  Apr2010     1030 2473
# 23: Company B Product 5  Apr2010     2129  558
# 24: Company C Product 6  Apr2010      814 1693

選項 3: merged.stack

我的“splitstackshape”pacakge 有一個名為merged.stack的函數,它試圖使這種特殊的整形更容易進行。 有了它,你可以嘗試:

library(splitstackshape)
merged.stack(yourData, var.stubs = c("revenues", "cost"), sep = "var.stubs")
#       company   product .time_1 revenues cost
#  1: Company A Product 1 Apr2010     1450 2457
#  2: Company A Product 1 Feb2010     2862 1705
#  3: Company A Product 1 Jan2010     1164 1168
#  4: Company A Product 1 Mar2010     2218 2486
#  5: Company A Product 2 Apr2010     2444 1883
#  6: Company A Product 2 Feb2010     2152 1999
#  7: Company A Product 2 Jan2010     1430 1465
#  8: Company A Product 2 Mar2010     1460  770
#  9: Company B Product 3 Apr2010     2837 1824
# 10: Company B Product 3 Feb2010     2073 1734
# ... 

有一天,我會找時間更新功能,這是以前寫的melt在“data.table”可以處理一個半寬的輸出格式。 我已經想出了一個部分解決方案,但后來我不再擺弄它了。

事實上,使用上面的鏈接函數,解決方案很簡單:

ReshapeLong_(yourData, c("revenues", "cost"))

選項4:從“tidyverse”中extract

使用 tidyverse 的其他解決方案似乎以一種非常奇怪的方式處理事情。 更好的解決方案是使用extract將您需要的數據放入新列中。 你必須先gather數據到一個很長的格式,然后spread的數據成為一個廣泛的格式。

這是我將使用的方法:

library(tidyverse)
yourData %>% 
  gather(var, val, -company, -product) %>%
  extract(var, into = c("type", "month", "year"), 
          regex = ("(revenues|cost)(...)(.*)")) %>%
  spread(type, val)
#      company   product month year cost revenues
# 1  Company A Product 1   Apr 2010 2457     1450
# 2  Company A Product 1   Feb 2010 1705     2862
# 3  Company A Product 1   Jan 2010 1168     1164
# 4  Company A Product 1   Mar 2010 2486     2218
# 5  Company A Product 2   Apr 2010 1883     2444
# 6  Company A Product 2   Feb 2010 1999     2152
# ...

這里的棘手之處在於您將日期打包到列名中。 在您按照自己的意願制作表格之前,必須先解析這些內容。 我遍歷了每一列,解析每個子表列名稱的日期和觀察類型,綁定每個子表,然后轉換成本/收入。 我相信那里有一個更優雅的解決方案。

library(reshape)

## making a table similar to yours here
yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
                       product = paste("Product", 1:6),
                       revenuesJan2010 = round(runif(6, 500, 3000)),
                       revenuesFeb2010 = round(runif(6, 500, 3000)),
                       revenuesMar2010 = round(runif(6, 500, 3000)),
                       revenuesApr2010 = round(runif(6, 500, 3000)),
                       costJan2010 = round(runif(6, 500, 3000)),
                       costFeb2010 = round(runif(6, 500, 3000)),
                       costMar2010 = round(runif(6, 500, 3000)),
                       costApr2010 = round(runif(6, 500, 3000)))

## a function that parses the date from the column name
columnParse <- function(tab){
    colNm   <- names(tab)[3]
    names(tab)[3] <- "value"
    colDate  <- strsplit(colNm, "revenues|cost")[[1]][2]
    colDate  <- gsub("([A-Za-z]+)", "\\1-", colDate)
    tab$date <- colDate
    tab$type <- gsub("(revenues|cost).*", "\\1", colNm)
    return(tab)
}

## running that function against sub tables of your data, then binding
yourDataLong <- do.call(rbind,
                        lapply(3:ncol(yourData),
                               function(x) columnParse(yourData[c(1:2, x)])))

## casting your data on cost/revenue
yourDataCast <- cast(yourDataLong, company+product+date~type, value = "value")

這是使用tidyversestringr的另一個選項:

yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
                   product = paste("Product", 1:6),
                   REVENUESJan2010 = round(runif(6, 500, 3000)),
                   REVENUESFeb2010 = round(runif(6, 500, 3000)),
                   REVENUESMar2010 = round(runif(6, 500, 3000)),
                   REVENUESApr2010 = round(runif(6, 500, 3000)),
                   COSTSJan2010 = round(runif(6, 500, 3000)),
                   COSTSFeb2010 = round(runif(6, 500, 3000)),
                   COSTSMar2010 = round(runif(6, 500, 3000)),
                   COSTSApr2010 = round(runif(6, 500, 3000)))

使用tidyversestringr的解決方案:

library(tidyverse)
library(stringr)

newData <- yourData %>%
   gather(key = rev.cost.date, value, -company, -product) %>%
   mutate(finance.type = ifelse(str_detect(rev.cost.date, fixed("REVENUES")), "REVENUES", "COSTS")) %>%
   mutate(date = str_replace(rev.cost.date, "REVENUES|COSTS", "")) %>%
   select(-rev.cost.date) %>%
   spread(value = value, key = finance.type) %>%
   mutate(date = paste0(str_sub(date, 0, 3), "-", str_sub(date, 4,8))

從 1.9.6 版(CRAN 2015 年 9 月 19 日)開始, data.table可以同時融合多個列(使用patterns()函數)。 因此,以REVENUESCOSTS開頭的列可以收集到兩個單獨的列中。

此外,日期(月份)被打包到沒有分隔符的列名中。 這些是使用帶有后視的正則表達式從列名中提取的,用於替換DATE列的因子水平。

library(data.table)
library(magrittr)
cols <- c("REVENUES", "COSTS")
long <- melt(wide, measure.vars = patterns(cols), value.name = cols, variable.name = "DATE")
months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit() 
long[, DATE := forcats::lvls_revalue(DATE, months)]
long
 COMPANY PRODUCT DATE REVENUES COSTS 1: COMPANY A PRODUCT 1 JAN2010 6400 8500 2: COMPANY A PRODUCT 2 JAN2010 2700 2850 3: COMPANY B PRODUCT 3 JAN2010 5900 4200 4: COMPANY B PRODUCT 4 JAN2010 550 200 5: COMPANY B PRODUCT 5 JAN2010 1500 1850 6: COMPANY C PRODUCT 6 JAN2010 19300 18200 7: COMPANY A PRODUCT 1 FEB2010 11050 10400 8: COMPANY A PRODUCT 2 FEB2010 3000 2400 9: COMPANY B PRODUCT 3 FEB2010 4150 6100 10: COMPANY B PRODUCT 4 FEB2010 600 700 11: COMPANY B PRODUCT 5 FEB2010 3750 1700 12: COMPANY C PRODUCT 6 FEB2010 17250 26950 13: COMPANY A PRODUCT 1 MARCH2010 6550 9100 14: COMPANY A PRODUCT 2 MARCH2010 2800 3100 15: COMPANY B PRODUCT 3 MARCH2010 5750 2950 16: COMPANY B PRODUCT 4 MARCH2010 0 100 17: COMPANY B PRODUCT 5 MARCH2010 550 3150 18: COMPANY C PRODUCT 6 MARCH2010 23600 18200 19: COMPANY A PRODUCT 1 DEC2016 10600 9850 20: COMPANY A PRODUCT 2 DEC2016 3800 3250 21: COMPANY B PRODUCT 3 DEC2016 3750 4600 22: COMPANY B PRODUCT 4 DEC2016 650 500 23: COMPANY B PRODUCT 5 DEC2016 2100 450 24: COMPANY C PRODUCT 6 DEC2016 21250 23900 COMPANY PRODUCT DATE REVENUES COSTS

編輯:使用 ISO 月份命名方案進行正確排序

使用字母月份名稱和年份的命名方案不允許按DATE正確排序數據。 DEC2016FEB2010之前, FEB2010JAN2010之前。 ISO 8601 命名約定將年份放在首位,然后是月份數。

我們可以使用這個命名方案如下:

months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit() %>%
  paste0("01", .) %>% lubridate::dmy() %>% format("%Y-%m")
long[, DATE := forcats::lvls_revalue(DATE, months)]
long
 COMPANY PRODUCT DATE REVENUES COSTS 1: COMPANY A PRODUCT 1 2010-01 6400 8500 2: COMPANY A PRODUCT 2 2010-01 2700 2850 3: COMPANY B PRODUCT 3 2010-01 5900 4200 4: COMPANY B PRODUCT 4 2010-01 550 200 5: COMPANY B PRODUCT 5 2010-01 1500 1850 6: COMPANY C PRODUCT 6 2010-01 19300 18200 7: COMPANY A PRODUCT 1 2010-02 11050 10400 8: COMPANY A PRODUCT 2 2010-02 3000 2400 9: COMPANY B PRODUCT 3 2010-02 4150 6100 10: COMPANY B PRODUCT 4 2010-02 600 700 11: COMPANY B PRODUCT 5 2010-02 3750 1700 12: COMPANY C PRODUCT 6 2010-02 17250 26950 13: COMPANY A PRODUCT 1 2010-03 6550 9100 14: COMPANY A PRODUCT 2 2010-03 2800 3100 15: COMPANY B PRODUCT 3 2010-03 5750 2950 16: COMPANY B PRODUCT 4 2010-03 0 100 17: COMPANY B PRODUCT 5 2010-03 550 3150 18: COMPANY C PRODUCT 6 2010-03 23600 18200 19: COMPANY A PRODUCT 1 2016-12 10600 9850 20: COMPANY A PRODUCT 2 2016-12 3800 3250 21: COMPANY B PRODUCT 3 2016-12 3750 4600 22: COMPANY B PRODUCT 4 2016-12 650 500 23: COMPANY B PRODUCT 5 2016-12 2100 450 24: COMPANY C PRODUCT 6 2016-12 21250 23900 COMPANY PRODUCT DATE REVENUES COSTS

數據

library(data.table)
wide <- data.table(
readr::read_table(
"  COMPANY   PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010     REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010     COSTSDEC2016
COMPANY A PRODUCT 1            6400           11050              6550               10600         8500        10400           9100             9850
COMPANY A PRODUCT 2            2700            3000              2800                3800         2850         2400           3100             3250
COMPANY B PRODUCT 3            5900            4150              5750                3750         4200         6100           2950             4600
COMPANY B PRODUCT 4             550             600                 0                 650          200          700            100              500
COMPANY B PRODUCT 5            1500            3750               550                2100         1850         1700           3150              450
COMPANY C PRODUCT 6           19300           17250             23600               21250        18200        26950          18200            23900"
))

我認為最明確的(即不需要重命名變量)在 R 中從寬到長的方法是使用基本的 R reshape()函數,並指定要“堆疊”為list的不同列。 請參閱博客文章。

我將使用JMT2080AD 答案中的數據並將種子設置為set.seed(789)

### Create a list of the variables you want to reshape/stack
reshape.vars <- list(c("revenuesJan2010",   "revenuesFeb2010",  "revenuesMar2010",  "revenuesApr2010"), # revenues
                     c("costJan2010",   "costFeb2010",  "costMar2010",  "costApr2010")) # cost 
### reshape wide to long
reshape(yourData,                      #dataframe
        direction="long",             #wide to long
        varying=reshape.vars, #repeated measures list of indexes for vars to stack/reshape
        timevar="date",              #the repeated measures times
        v.names=c("revenues", "cost")) #the repeated measures names

#     company   product date   revenues cost id
# 1.1 Company A Product 1    1     2250 1574  1
# 2.1 Company A Product 2    1      734 1793  2
# 3.1 Company B Product 3    1      530 1282  3
# 4.1 Company B Product 4    1     1979 1741  4
# 5.1 Company B Product 5    1     1730 2558  5
# 6.1 Company C Product 6    1      550 1757  6
# 1.2 Company A Product 1    2     1932 1048  1
#...
# 5.3 Company B Product 5    3      890 1103  5
# 6.3 Company C Product 6    3     2113 2469  6
# 1.4 Company A Product 1    4     2426 2382  1
# 2.4 Company A Product 2    4      778 2995  2
# 3.4 Company B Product 3    4     1359  989  3
# 4.4 Company B Product 4    4     1618  912  4
# 5.4 Company B Product 5    4      895 2109  5
# 6.4 Company C Product 6    4     1258 2803  6

使用list方法

  • 您不必重命名變量
  • 由於您要創建的變量已在列表中明確定義,因此reshape()推斷應堆疊哪些變量不會出現錯誤

我發現即使有 100 多個變量要重塑,如果重命名它們可能很麻煩,那么使用復制/粘貼來創建可變變量列表也不會花費那么長時間。

作為喜歡在 stata 中重塑的 stata 到 r 轉換者,我發現 tidyr::gather 和 tidyr::spread 非常直觀。 聚集基本上是重塑長,傳播是重塑寬。

這是將您的數據更改為您想要的方式的代碼:

new_data <- 
gather(data = your-data-frame, 
       key = var_holder,
       value = val_holder,
       -company,
       -product) 

new_data$var_holder <- sub("REVENUE", "cost_", new_data$var_holder)                                     
new_data$var_holder <- sub("COST", "cost_", new_data$var_holder)

new_data <- 
    separate(data = new_data,
             col = var_holder,
             into = c("var", "date")) %>%
    spread(key = var,
           value = val_holder)

並做了!

收集通過獲取所有指定的變量名稱(或在此未指定,注意前面有“-”符號的兩個變量)來工作,並將它們放在名稱由“key = ...”指定的新變量下(在進行時創建新行)。 然后,它獲取屬於這些變量的值,並將它們放在名稱由“value = ...”指定的單個變量下。

傳播在相反的方向起作用。 希望這可以幫助!

使用tidyr的開發版本的tidyr (版本 - '0.8.3.9000')

library(dplyr)
library(tidyr)
library(stringr)
library(zoo)
library(readr)

df1 %>% 
   rename_at(3:ncol(.), ~ str_replace(., "^(REVENUES|COSTS)", "\\1_")) %>%
   pivot_longer(c(-COMPANY, -PRODUCT), names_to = c(".value", "DATE"), names_sep = "_") %>% 
   mutate(DATE = format(as.yearmon(DATE), "%b-%Y"))
# A tibble: 24 x 5
#   COMPANY   PRODUCT   DATE     REVENUES COSTS
#   <chr>     <chr>     <chr>       <dbl> <dbl>
# 1 COMPANY A PRODUCT 1 Jan-2010     6400  8500
# 2 COMPANY A PRODUCT 1 Feb-2010    11050 10400
# 3 COMPANY A PRODUCT 1 Mar-2010     6550  9100
# 4 COMPANY A PRODUCT 1 Dec-2016    10600  9850
# 5 COMPANY A PRODUCT 2 Jan-2010     2700  2850
# 6 COMPANY A PRODUCT 2 Feb-2010     3000  2400
# 7 COMPANY A PRODUCT 2 Mar-2010     2800  3100
# 8 COMPANY A PRODUCT 2 Dec-2016     3800  3250
# 9 COMPANY B PRODUCT 3 Jan-2010     5900  4200
#10 COMPANY B PRODUCT 3 Feb-2010     4150  6100
# … with 14 more rows

數據

df1 <- structure(list(COMPANY = c("COMPANY A", "COMPANY A", "COMPANY B", 
"COMPANY B", "COMPANY B", "COMPANY C"), PRODUCT = c("PRODUCT 1", 
"PRODUCT 2", "PRODUCT 3", "PRODUCT 4", "PRODUCT 5", "PRODUCT 6"
), REVENUESJAN2010 = c(6400, 2700, 5900, 550, 1500, 19300), REVENUESFEB2010 = c(11050, 
3000, 4150, 600, 3750, 17250), REVENUESMARCH2010 = c(6550, 2800, 
5750, 0, 550, 23600), REVENUESDEC2016 = c(10600, 3800, 3750, 
650, 2100, 21250), COSTSJAN2010 = c(8500, 2850, 4200, 200, 1850, 
18200), COSTSFEB2010 = c(10400, 2400, 6100, 700, 1700, 26950), 
    COSTSMARCH2010 = c(9100, 3100, 2950, 100, 3150, 18200), COSTSDEC2016 = c(9850, 
    3250, 4600, 500, 450, 23900)), class = c("spec_tbl_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(
    cols = list(COMPANY = structure(list(), class = c("collector_character", 
    "collector")), PRODUCT = structure(list(), class = c("collector_character", 
    "collector")), REVENUESJAN2010 = structure(list(), class = c("collector_double", 
    "collector")), REVENUESFEB2010 = structure(list(), class = c("collector_double", 
    "collector")), REVENUESMARCH2010 = structure(list(), class = c("collector_double", 
    "collector")), REVENUESDEC2016 = structure(list(), class = c("collector_double", 
    "collector")), COSTSJAN2010 = structure(list(), class = c("collector_double", 
    "collector")), COSTSFEB2010 = structure(list(), class = c("collector_double", 
    "collector")), COSTSMARCH2010 = structure(list(), class = c("collector_double", 
    "collector")), COSTSDEC2016 = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM