简体   繁体   中英

Reshape and combine data from long to wide format with R

I have data which looks like this:-

Customer.Name  Posting.Date  SKU Sales.Amount($)    
A         20150124    SKU1     12345  
B         20150125    SKU2     22345 
A         20150124    SKU3     52345  
A         20150125    SKU4     55345  
C         20150126    SKU6     62345  
B         20150127    SKU7     72345  
C         20150126    SKU8     72945

I'm trying to convert it into the following structures without column names:- Interim Data Structure:-

A         20150124    SKU1     SKU 3  
A         20150125    SKU4     
B         20150125    SKU2 
C         20150126    SKU6     SKU8  
B         20150127    SKU7

The interim output is sorted by posting date and each SKU purchased as part of a customer level transaction (identified by customer name + posting date) needs to get converted to a column.

Final data structure (below) needs to just contains SKUs purchased in transactions with the customer name and posting date columns removed from the interim output above.

SKU1     SKU3 
SKU4
SKU2   
SKU6     SKU8 
SKU7

I have tried the following code to get the final structure but am getting an error.

library(dplyr)
library(tidyr)
Dataset2 <- Data %>% 
group_by(Customer.Name, Posting.Date) %>%
summarise(SKU=toString(unique(SKU))) %>% 
select(Customer.Name, Posting.Date,SKU) %>%
spread(Customer.Name,SKU,drop=FALSE)

The error I get is

Error: Each variable must be named. Problem variables: 2

Could you please guide on how to solve this problem?

Here's one attempt. It has column names, though they can be discarded later as needed.

df <- data.frame(
  V1=c(     "A",      "B",      "A",      "A",      "C",      "B",      "C" ),
  V2=c(20150124, 20150125, 20150124, 20150125, 20150126, 20150127, 20150126 ),
  V3=c(  "SKU1",   "SKU2",   "SKU3",   "SKU4",   "SKU6",   "SKU7",   "SKU8" ),
  V4=c(   12345,    22345,    52345,    55345,    62345,    72345,    72945 ),
  stringsAsFactors = FALSE
)

I'm inferring from your question that the fourth column is unnecessary, so I remove it.

library(dplyr)
library(tidyr)

out1 <- df %>%
  select(-V4) %>%
  group_by(V1, V2) %>%
  mutate(row = row_number()) %>%
  spread(row, V3) %>%
  ungroup()
out1
# # A tibble: 5 × 4
#      V1       V2   `1`   `2`
# * <chr>    <dbl> <chr> <chr>
# 1     A 20150124  SKU1  SKU3
# 2     A 20150125  SKU4  <NA>
# 3     B 20150125  SKU2  <NA>
# 4     B 20150127  SKU7  <NA>
# 5     C 20150126  SKU6  SKU8

out1 %>% select(-V1, -V2)
# # A tibble: 5 × 2
#     `1`   `2`
# * <chr> <chr>
# 1  SKU1  SKU3
# 2  SKU4  <NA>
# 3  SKU2  <NA>
# 4  SKU7  <NA>
# 5  SKU6  SKU8

It is better to have column names without special characters. If we remove it

colnames(Data)[4] <- "Sales.Amount"

The OP's code works fine

Dataset2
# A tibble: 4 × 4
#  Posting.Date          A     B          C
#*         <int>      <chr> <chr>      <chr>
#1     20150124 SKU1, SKU3  <NA>       <NA>
#2     20150125       SKU4  SKU2       <NA>
#3     20150126       <NA>  <NA> SKU6, SKU8 
#4     20150127       <NA>  SKU7       <NA>

Or even with it, it works fine on tidyr_0.6.0 and dplyr_0.5.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM