简体   繁体   English

R将BigCartel csv文件中的列拆分为dataframe或data.table中的长格式

[英]R split column in BigCartel csv file into long format in dataframe or data.table

Big Cartel has an option that exports orders into a csv file. 大卡特尔有一个选项可以将订单导出到一个csv文件中。 However the structure is not very good for the analysis I need to do. 但是,该结构对于我需要进行的分析不是很好。

Here is a subset of the columns and rows from a Big cartel csv order download (there are other columns which are not significant to the problem at hand). 这是大型卡特尔csv订单下载中的列和行的子集(还有其他列对于手头的问题并不重要)。

Number, Buyer name,Items,Item count,Item total,Total price,Total shipping,Total tax,Total discount
1,jim,product_name:Plate|product_option_name:Red|quantity:1|price:9.99|total:9.99,1,9.99,11.98,1.99,0,0
2,bill,product_name:Plate|product_option_name:Green|quantity:1|price:9.99|total:9.99;product_name:Plate|product_option_name:Blue|quantity:1|price:9.99|total:9.99,2,19.98,22.98,3,0,0
3,jane,product_name:Plate|product_option_name:Red|quantity:1|price:6.99|total:6.99;product_name:Thingy|product_option_name:|quantity:1|price:9.99|total:9.99;product_name:Mug|product_option_name:Grey|quantity:1|price:10.99|total:10.99;product_name:Cup|product_option_name:Grey|quantity:1|price:9.99|total:9.99;product_name:Saucer|product_option_name:Grey|quantity:1|price:9.99|total:9.99;product_name:Stopper|product_option_name:|quantity:1|price:9.99|total:9.99,6,57.94,64.94,7,0,0
4,dale,product_name:Plate|product_option_name:Green|quantity:1|price:10.99|total:10.99,1,10.99,13.99,4.99,0,1.99

The items column can have multiple "line-items" with a semicolon (;) as the separator. items列可以包含多个以分号(;)为分隔符的“行项目”。 Each "line-item" has five attributes separated with a pipe (|), ie product_name,product_option_name, quantity, price and total (ie for the line). 每个“行项目”都有五个用竖线(|)分隔的属性,即product_name,product_option_name,数量,价格和总计(即该行)。 There is a column, "Item count", which gives the number of "line-items" plus columns for (order) Total Price, shipping, Tax and Discount. 有一个“项目计数”列,其中提供了“行项目”数以及(订单)总价,运费,税金和折扣的列。 For analysis I'd like the data in the following long format where shipping, tax and discount are also treated as 'product items'. 为了进行分析,我希望采用以下长格式的数据,其中运输,税收和折扣也被视为“产品”。

Number Buyer name line-item    product_option_name quantity price total
1      jim        Plate        Red                 1        9.99  9.99
1      jim        shipping                         1        1.99  1.99
1      jim        tax                              0        0     0
1      jim        discount                         0        0     0
2      bill       Plate        Green               1        9.99  9.99
2      bill       Plate        Blue                1        9.99  9.99
2      bill       shipping                         1        3     3
2      bill       tax                              0        0     0
2      bill       discount                         0        0     0
3      jane       Plate        Red                 1        6.99  6.99
3      jane       Thingy                           1        9.99  9.99
3      jane       Mug          Grey                1        10.99 10.99
3      jane       Cup          Grey                1        9.99  9.99
3      jane       Saucer       Grey                1        9.99  9.99
3      jane       Stopper                          1        9.99  9.99
3      jane       shipping                         1        7     7
3      jane       tax                              0        0     0
3      jane       discount                         0        0     0
4      dale       Plate        Green               1        10.99 10.99
4      dale       shipping                         1        4.99  4.99
4      dale       tax                              0        0     
4      dale       discount                         0        -1.99 -1.99

Using tstrsplit() from r:data.table and cSplit() from r:splitstackshape seems to be the solution but I can't get the syntax right. 解决方案:使用r:data.table中的tstrsplit()和r:splitstackshape中的cSplit()似乎是解决方案,但我无法正确理解语法。 I've also tried the tidyverse/dplyr functions separate/spread etc but I just cannot get the output I need whatever I do. 我也尝试过tidyverse / dplyr函数的split / spread等,但是我无法获得我需要做的输出。

I've been googling and searching through all the SO questions - there are some solutions (this one R: Split Variable Column into multiple (unbalanced) columns by comma ) that are close but none quite get me there as most assume a wide' format rather than 'long'. 我一直在谷歌搜索和搜索所有的SO问题-有一些解决方案(这个R:用逗号将变量列拆分为多个(不平衡的)列 )接近,但没有一个能使我到达那里,因为大多数都假设我使用宽格式而不是“长”。

Something like this may get you what you're looking for. 这样的事情可能会为您提供所需的东西。

library(dplyr)
library(tidyr)
library(stringr)

filepath <- # Path to datafile here

df <- read.csv(filepath, stringsAsFactors = FALSE)

cols <- paste0("col", 1:(max(str_count(df$Items, ";")) + 1))

df <- df %>%
      separate(col = Items, into = cols, sep = ";", fill = "right") %>%
      gather_("column", "details", cols, na.rm = TRUE) %>%
      select(-column) %>%
      separate(col = details, into = c("product_name", "product_option_name","quantity","price","total"), sep = "\\|", fill = "right") %>%
      mutate(product_name = sub("^.*\\:", "", product_name),
             product_option_name = sub("^.*\\:", "", product_option_name),
             quantity = sub("^.*\\:", "", quantity),
             price = sub("^.*\\:", "", price),
             total = sub("^.*\\:", "", total)) %>%
      gather("line", "item", c(Total.shipping, Total.discount, Total.tax, product_name)) %>%
      mutate(product_option_name = ifelse(line == "product_name" & product_option_name != "", product_option_name, NA),
             line_item = ifelse(line == "product_name", item, sub("^.*\\.","", line)),
             price = ifelse(line == "product_name", price, item),
             price = ifelse(line_item == "discount", as.numeric(price) * (-1), price),
             quantity = ifelse(line_item %in% c("shipping","discount","tax") & price == "0", 0, quantity),
             total = as.numeric(price) * as.numeric(quantity)) %>%
      distinct() %>%
      select(Number, Buyer.name, line_item, product_option_name, quantity, price, total) %>%
      arrange(Number)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM