简体   繁体   中英

Reshape large dataset with multiple columns from wide to long

I have a very large dataset that I need to reshape from wide to long.

My datset looks something like:

  COMPANY   PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010 ... REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010 ... COSTSDEC2016
COMPANY A PRODUCT 1            6400           11050              6550               10600         8500        10400           9100             9850
COMPANY A PRODUCT 2            2700            3000              2800                3800         2850         2400           3100             3250
COMPANY B PRODUCT 3            5900            4150              5750                3750         4200         6100           2950             4600
COMPANY B PRODUCT 4             550             600                 0                 650          200          700            100              500
COMPANY B PRODUCT 5            1500            3750               550                2100         1850         1700           3150              450
COMPANY C PRODUCT 6           19300           17250             23600               21250        18200        26950          18200            23900

And I want them to look like:

  COMPANY    PRODUCT    DATE  REVENUES  COSTS
COMPANY A  PRODUCT 1  Dec-16     10600   9850
COMPANY A  PRODUCT 1  Feb-10     11050  10400
COMPANY A  PRODUCT 1  Jan-10      6400   8500
COMPANY A  PRODUCT 1  Mar-10      6550   9100
COMPANY A  PRODUCT 2  Dec-16      3800   3250
COMPANY A  PRODUCT 2  Feb-10      3000   2400
COMPANY A  PRODUCT 2  Jan-10      2700   2850
COMPANY A  PRODUCT 2  Mar-10      2800   3100
COMPANY B  PRODUCT 3  Dec-16      3750   4600
COMPANY B  PRODUCT 3  Feb-10      4150   6100
COMPANY B  PRODUCT 3  Jan-10      5900   4200
COMPANY B  PRODUCT 3  Mar-10      5750   2950
COMPANY B  PRODUCT 4  Dec-16       650    500
COMPANY B  PRODUCT 4  Feb-10       600    700
COMPANY B  PRODUCT 4  Jan-10       550    200
COMPANY B  PRODUCT 4  Mar-10         0    100
COMPANY B  PRODUCT 5  Dec-16      2100    450
COMPANY B  PRODUCT 5  Feb-10      3750   1700
COMPANY B  PRODUCT 5  Jan-10      1500   1850
COMPANY B  PRODUCT 5  Mar-10       550   3150
COMPANY C  PRODUCT 6  Dec-16     21250  23900
COMPANY C  PRODUCT 6  Feb-10     17250  26950
COMPANY C  PRODUCT 6  Jan-10     19300  18200
COMPANY C  PRODUCT 6  Mar-10     23600  18200

In Stata I would type reshape long REVENUES COSTS, i(COMPANY PRODUCT) j(DATE) string

How do I do this in R?

There are several other ways to approach this that would be a bit more streamlined than the "tidyverse" options already suggested.

All of the following examples use the sample data from JMT2080AD's answer with set.seed(1) (for reproducibility).

Option 1: Base R's reshape

It's not always the easist function to use, but the reshape function is pretty powerful once you figre it out. In this case, you don't have a sep , which makes things a little bit trickier in the sense that you have to be more specific about things like your resulting variable names and the values that should show up as "times" (by default, they would just sequential numbers).

times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
reshape(yourData, direction = "long", 
        varying = grep("revenues|cost", names(yourData)), sep = "", 
        v.names = c("revenues", "cost"), timevar = "date", times = times)
#             company   product    date revenues cost id
# 1.Jan2010 Company A Product 1 Jan2010     2862 1164  1
# 2.Jan2010 Company A Product 2 Jan2010     2152 1430  2
# 3.Jan2010 Company B Product 3 Jan2010     2073 1932  3
# 4.Jan2010 Company B Product 4 Jan2010      654 2771  4
# 5.Jan2010 Company B Product 5 Jan2010     1015 1004  5
# 6.Jan2010 Company C Product 6 Jan2010      941 2746  6
# ....

That's pretty much what you're looking for, maybe with a bit of difference in the date formatting.

Option 2: data.table

If performance is what you're after, you can look at melt from "data.table", with which you should be able to do something like the following. As with the reshape approach, you'll need to store the "times" to reintroduce the dates after melt ing the data.

(Note: I know this is very similar to @Uwe's approach .)

library(data.table)
times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
melt(as.data.table(yourData), measure.vars = patterns("revenues", "cost"),
     value.name = c("revenues", "cost"))[
       , variable := factor(variable, labels = times)][]
#       company   product variable revenues cost
#  1: Company A Product 1  Jan2010     1164 1168
#  2: Company A Product 2  Jan2010     1430 1465
#  3: Company B Product 3  Jan2010     1932  533
#  4: Company B Product 4  Jan2010     2771 1456
#  5: Company B Product 5  Jan2010     1004 2674
# ---                                           
# 20: Company A Product 2  Apr2010     2444 1883
# 21: Company B Product 3  Apr2010     2837 1824
# 22: Company B Product 4  Apr2010     1030 2473
# 23: Company B Product 5  Apr2010     2129  558
# 24: Company C Product 6  Apr2010      814 1693

Option 3: merged.stack

My "splitstackshape" pacakge has a function called merged.stack that tries to make this particular kind of reshaping easier to do. With it, you could try:

library(splitstackshape)
merged.stack(yourData, var.stubs = c("revenues", "cost"), sep = "var.stubs")
#       company   product .time_1 revenues cost
#  1: Company A Product 1 Apr2010     1450 2457
#  2: Company A Product 1 Feb2010     2862 1705
#  3: Company A Product 1 Jan2010     1164 1168
#  4: Company A Product 1 Mar2010     2218 2486
#  5: Company A Product 2 Apr2010     2444 1883
#  6: Company A Product 2 Feb2010     2152 1999
#  7: Company A Product 2 Jan2010     1430 1465
#  8: Company A Product 2 Mar2010     1460  770
#  9: Company B Product 3 Apr2010     2837 1824
# 10: Company B Product 3 Feb2010     2073 1734
# ... 

One day, I'll get around to updating the function, which was written before melt in "data.table" could handle a semi-wide output format. I've already come up with a partial solution , but then I stopped fiddling with it.

In fact, using the above linked function, the solution would be a simple:

ReshapeLong_(yourData, c("revenues", "cost"))

Option 4: extract from the "tidyverse"

The other solutions using the tidyverse seem to be going about things in a very strange manner. A better solution would be to use extract to get the data you need into new columns. You'll have to first gather the data into a very long format and then spread the data into a wide format.

Here's the approach I would use:

library(tidyverse)
yourData %>% 
  gather(var, val, -company, -product) %>%
  extract(var, into = c("type", "month", "year"), 
          regex = ("(revenues|cost)(...)(.*)")) %>%
  spread(type, val)
#      company   product month year cost revenues
# 1  Company A Product 1   Apr 2010 2457     1450
# 2  Company A Product 1   Feb 2010 1705     2862
# 3  Company A Product 1   Jan 2010 1168     1164
# 4  Company A Product 1   Mar 2010 2486     2218
# 5  Company A Product 2   Apr 2010 1883     2444
# 6  Company A Product 2   Feb 2010 1999     2152
# ...

The tricky thing here is that you have your dates packed into the column names. Those have to be parsed out before you can make the table as you would like it. I have iterated through each column, parsing each sub table column name for the date and type of observation, binding each sub table, then casting on cost/revenue. I'm sure there is a more elegant solution out there.

library(reshape)

## making a table similar to yours here
yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
                       product = paste("Product", 1:6),
                       revenuesJan2010 = round(runif(6, 500, 3000)),
                       revenuesFeb2010 = round(runif(6, 500, 3000)),
                       revenuesMar2010 = round(runif(6, 500, 3000)),
                       revenuesApr2010 = round(runif(6, 500, 3000)),
                       costJan2010 = round(runif(6, 500, 3000)),
                       costFeb2010 = round(runif(6, 500, 3000)),
                       costMar2010 = round(runif(6, 500, 3000)),
                       costApr2010 = round(runif(6, 500, 3000)))

## a function that parses the date from the column name
columnParse <- function(tab){
    colNm   <- names(tab)[3]
    names(tab)[3] <- "value"
    colDate  <- strsplit(colNm, "revenues|cost")[[1]][2]
    colDate  <- gsub("([A-Za-z]+)", "\\1-", colDate)
    tab$date <- colDate
    tab$type <- gsub("(revenues|cost).*", "\\1", colNm)
    return(tab)
}

## running that function against sub tables of your data, then binding
yourDataLong <- do.call(rbind,
                        lapply(3:ncol(yourData),
                               function(x) columnParse(yourData[c(1:2, x)])))

## casting your data on cost/revenue
yourDataCast <- cast(yourDataLong, company+product+date~type, value = "value")

Here's another option using tidyverse and stringr :

yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
                   product = paste("Product", 1:6),
                   REVENUESJan2010 = round(runif(6, 500, 3000)),
                   REVENUESFeb2010 = round(runif(6, 500, 3000)),
                   REVENUESMar2010 = round(runif(6, 500, 3000)),
                   REVENUESApr2010 = round(runif(6, 500, 3000)),
                   COSTSJan2010 = round(runif(6, 500, 3000)),
                   COSTSFeb2010 = round(runif(6, 500, 3000)),
                   COSTSMar2010 = round(runif(6, 500, 3000)),
                   COSTSApr2010 = round(runif(6, 500, 3000)))

The solution using tidyverse and stringr :

library(tidyverse)
library(stringr)

newData <- yourData %>%
   gather(key = rev.cost.date, value, -company, -product) %>%
   mutate(finance.type = ifelse(str_detect(rev.cost.date, fixed("REVENUES")), "REVENUES", "COSTS")) %>%
   mutate(date = str_replace(rev.cost.date, "REVENUES|COSTS", "")) %>%
   select(-rev.cost.date) %>%
   spread(value = value, key = finance.type) %>%
   mutate(date = paste0(str_sub(date, 0, 3), "-", str_sub(date, 4,8))

As of version 1.9.6 (on CRAN 19 Sep 2015), data.table can melt multiple columns simultaneously (using the patterns() function). So, the columns starting with REVENUES and COSTS can be gathered into two separate columns.

In addition, the dates (months) are packed into the columns names without separator. These are extracted from the column names using a regular expression with look-behind and are used to replace the factor levels of the DATE column.

library(data.table)
library(magrittr)
cols <- c("REVENUES", "COSTS")
long <- melt(wide, measure.vars = patterns(cols), value.name = cols, variable.name = "DATE")
months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit() 
long[, DATE := forcats::lvls_revalue(DATE, months)]
long
 COMPANY PRODUCT DATE REVENUES COSTS 1: COMPANY A PRODUCT 1 JAN2010 6400 8500 2: COMPANY A PRODUCT 2 JAN2010 2700 2850 3: COMPANY B PRODUCT 3 JAN2010 5900 4200 4: COMPANY B PRODUCT 4 JAN2010 550 200 5: COMPANY B PRODUCT 5 JAN2010 1500 1850 6: COMPANY C PRODUCT 6 JAN2010 19300 18200 7: COMPANY A PRODUCT 1 FEB2010 11050 10400 8: COMPANY A PRODUCT 2 FEB2010 3000 2400 9: COMPANY B PRODUCT 3 FEB2010 4150 6100 10: COMPANY B PRODUCT 4 FEB2010 600 700 11: COMPANY B PRODUCT 5 FEB2010 3750 1700 12: COMPANY C PRODUCT 6 FEB2010 17250 26950 13: COMPANY A PRODUCT 1 MARCH2010 6550 9100 14: COMPANY A PRODUCT 2 MARCH2010 2800 3100 15: COMPANY B PRODUCT 3 MARCH2010 5750 2950 16: COMPANY B PRODUCT 4 MARCH2010 0 100 17: COMPANY B PRODUCT 5 MARCH2010 550 3150 18: COMPANY C PRODUCT 6 MARCH2010 23600 18200 19: COMPANY A PRODUCT 1 DEC2016 10600 9850 20: COMPANY A PRODUCT 2 DEC2016 3800 3250 21: COMPANY B PRODUCT 3 DEC2016 3750 4600 22: COMPANY B PRODUCT 4 DEC2016 650 500 23: COMPANY B PRODUCT 5 DEC2016 2100 450 24: COMPANY C PRODUCT 6 DEC2016 21250 23900 COMPANY PRODUCT DATE REVENUES COSTS

Edit: Using ISO months naming scheme for proper ordering

Using the naming scheme of alphabetic month name and year does not allow to sort the data by DATE properly. DEC2016 come before FEB2010 , and FEB2010 before JAN2010 . The ISO 8601 nameing convention puts the year first followed by the number of the months.

We can use this naming scheme as follows:

months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit() %>%
  paste0("01", .) %>% lubridate::dmy() %>% format("%Y-%m")
long[, DATE := forcats::lvls_revalue(DATE, months)]
long
 COMPANY PRODUCT DATE REVENUES COSTS 1: COMPANY A PRODUCT 1 2010-01 6400 8500 2: COMPANY A PRODUCT 2 2010-01 2700 2850 3: COMPANY B PRODUCT 3 2010-01 5900 4200 4: COMPANY B PRODUCT 4 2010-01 550 200 5: COMPANY B PRODUCT 5 2010-01 1500 1850 6: COMPANY C PRODUCT 6 2010-01 19300 18200 7: COMPANY A PRODUCT 1 2010-02 11050 10400 8: COMPANY A PRODUCT 2 2010-02 3000 2400 9: COMPANY B PRODUCT 3 2010-02 4150 6100 10: COMPANY B PRODUCT 4 2010-02 600 700 11: COMPANY B PRODUCT 5 2010-02 3750 1700 12: COMPANY C PRODUCT 6 2010-02 17250 26950 13: COMPANY A PRODUCT 1 2010-03 6550 9100 14: COMPANY A PRODUCT 2 2010-03 2800 3100 15: COMPANY B PRODUCT 3 2010-03 5750 2950 16: COMPANY B PRODUCT 4 2010-03 0 100 17: COMPANY B PRODUCT 5 2010-03 550 3150 18: COMPANY C PRODUCT 6 2010-03 23600 18200 19: COMPANY A PRODUCT 1 2016-12 10600 9850 20: COMPANY A PRODUCT 2 2016-12 3800 3250 21: COMPANY B PRODUCT 3 2016-12 3750 4600 22: COMPANY B PRODUCT 4 2016-12 650 500 23: COMPANY B PRODUCT 5 2016-12 2100 450 24: COMPANY C PRODUCT 6 2016-12 21250 23900 COMPANY PRODUCT DATE REVENUES COSTS

Data

library(data.table)
wide <- data.table(
readr::read_table(
"  COMPANY   PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010     REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010     COSTSDEC2016
COMPANY A PRODUCT 1            6400           11050              6550               10600         8500        10400           9100             9850
COMPANY A PRODUCT 2            2700            3000              2800                3800         2850         2400           3100             3250
COMPANY B PRODUCT 3            5900            4150              5750                3750         4200         6100           2950             4600
COMPANY B PRODUCT 4             550             600                 0                 650          200          700            100              500
COMPANY B PRODUCT 5            1500            3750               550                2100         1850         1700           3150              450
COMPANY C PRODUCT 6           19300           17250             23600               21250        18200        26950          18200            23900"
))

I think the most explicit (ie no need to rename variables) way to reshape wide to long in R is to use the base R reshape() function and and specify the varying columns to be "stacked" as a list . See this blog post.

I'll use the data from JMT2080AD's answer and set the seed to set.seed(789) .

### Create a list of the variables you want to reshape/stack
reshape.vars <- list(c("revenuesJan2010",   "revenuesFeb2010",  "revenuesMar2010",  "revenuesApr2010"), # revenues
                     c("costJan2010",   "costFeb2010",  "costMar2010",  "costApr2010")) # cost 
### reshape wide to long
reshape(yourData,                      #dataframe
        direction="long",             #wide to long
        varying=reshape.vars, #repeated measures list of indexes for vars to stack/reshape
        timevar="date",              #the repeated measures times
        v.names=c("revenues", "cost")) #the repeated measures names

#     company   product date   revenues cost id
# 1.1 Company A Product 1    1     2250 1574  1
# 2.1 Company A Product 2    1      734 1793  2
# 3.1 Company B Product 3    1      530 1282  3
# 4.1 Company B Product 4    1     1979 1741  4
# 5.1 Company B Product 5    1     1730 2558  5
# 6.1 Company C Product 6    1      550 1757  6
# 1.2 Company A Product 1    2     1932 1048  1
#...
# 5.3 Company B Product 5    3      890 1103  5
# 6.3 Company C Product 6    3     2113 2469  6
# 1.4 Company A Product 1    4     2426 2382  1
# 2.4 Company A Product 2    4      778 2995  2
# 3.4 Company B Product 3    4     1359  989  3
# 4.4 Company B Product 4    4     1618  912  4
# 5.4 Company B Product 5    4      895 2109  5
# 6.4 Company C Product 6    4     1258 2803  6

Using the list method

  • you do not have to rename variables
  • since the variables you want to create are explicitly defined in the list, there are not errors having to do with reshape() inferring which variables should be stacked

I've found that even when there are a 100+ variables to reshape, if renaming them can be cumbersome, then using copy/paste to create the varying variables list does not take that long.

As a stata to r convert who loved reshape in stata, I found tidyr::gather and tidyr::spread to be very intuitive. Gather is basically reshape long and spread is reshape wide.

Here is the code that would change your data to the way you want it:

new_data <- 
gather(data = your-data-frame, 
       key = var_holder,
       value = val_holder,
       -company,
       -product) 

new_data$var_holder <- sub("REVENUE", "cost_", new_data$var_holder)                                     
new_data$var_holder <- sub("COST", "cost_", new_data$var_holder)

new_data <- 
    separate(data = new_data,
             col = var_holder,
             into = c("var", "date")) %>%
    spread(key = var,
           value = val_holder)

And done!

gather works by taking all the variable names specified (or in this, NOT specified, note the two variables preceded by the '-' sign), and puts them under a new variable whose name is specified by "key = ..." (creating new rows as it goes). It then takes the values that fell under those variables and puts them under a single variable whose name is specified by "value = ...".

spread works in the opposite direction. Hope this helps!

An option using the devel version of tidyr (version - '0.8.3.9000')

library(dplyr)
library(tidyr)
library(stringr)
library(zoo)
library(readr)

df1 %>% 
   rename_at(3:ncol(.), ~ str_replace(., "^(REVENUES|COSTS)", "\\1_")) %>%
   pivot_longer(c(-COMPANY, -PRODUCT), names_to = c(".value", "DATE"), names_sep = "_") %>% 
   mutate(DATE = format(as.yearmon(DATE), "%b-%Y"))
# A tibble: 24 x 5
#   COMPANY   PRODUCT   DATE     REVENUES COSTS
#   <chr>     <chr>     <chr>       <dbl> <dbl>
# 1 COMPANY A PRODUCT 1 Jan-2010     6400  8500
# 2 COMPANY A PRODUCT 1 Feb-2010    11050 10400
# 3 COMPANY A PRODUCT 1 Mar-2010     6550  9100
# 4 COMPANY A PRODUCT 1 Dec-2016    10600  9850
# 5 COMPANY A PRODUCT 2 Jan-2010     2700  2850
# 6 COMPANY A PRODUCT 2 Feb-2010     3000  2400
# 7 COMPANY A PRODUCT 2 Mar-2010     2800  3100
# 8 COMPANY A PRODUCT 2 Dec-2016     3800  3250
# 9 COMPANY B PRODUCT 3 Jan-2010     5900  4200
#10 COMPANY B PRODUCT 3 Feb-2010     4150  6100
# … with 14 more rows

data

df1 <- structure(list(COMPANY = c("COMPANY A", "COMPANY A", "COMPANY B", 
"COMPANY B", "COMPANY B", "COMPANY C"), PRODUCT = c("PRODUCT 1", 
"PRODUCT 2", "PRODUCT 3", "PRODUCT 4", "PRODUCT 5", "PRODUCT 6"
), REVENUESJAN2010 = c(6400, 2700, 5900, 550, 1500, 19300), REVENUESFEB2010 = c(11050, 
3000, 4150, 600, 3750, 17250), REVENUESMARCH2010 = c(6550, 2800, 
5750, 0, 550, 23600), REVENUESDEC2016 = c(10600, 3800, 3750, 
650, 2100, 21250), COSTSJAN2010 = c(8500, 2850, 4200, 200, 1850, 
18200), COSTSFEB2010 = c(10400, 2400, 6100, 700, 1700, 26950), 
    COSTSMARCH2010 = c(9100, 3100, 2950, 100, 3150, 18200), COSTSDEC2016 = c(9850, 
    3250, 4600, 500, 450, 23900)), class = c("spec_tbl_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(
    cols = list(COMPANY = structure(list(), class = c("collector_character", 
    "collector")), PRODUCT = structure(list(), class = c("collector_character", 
    "collector")), REVENUESJAN2010 = structure(list(), class = c("collector_double", 
    "collector")), REVENUESFEB2010 = structure(list(), class = c("collector_double", 
    "collector")), REVENUESMARCH2010 = structure(list(), class = c("collector_double", 
    "collector")), REVENUESDEC2016 = structure(list(), class = c("collector_double", 
    "collector")), COSTSJAN2010 = structure(list(), class = c("collector_double", 
    "collector")), COSTSFEB2010 = structure(list(), class = c("collector_double", 
    "collector")), COSTSMARCH2010 = structure(list(), class = c("collector_double", 
    "collector")), COSTSDEC2016 = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM