简体   繁体   中英

Compress Large Data in R into csv without NULLS or LIST

FIRST TIME POSTING:

I'm preparing data for arules() read.transactions and need to compress unique Invoice data (500k+ cases) so that each unique Invoice and its associated info fits on a single line like this:

Invoice001,CustomerID,Country,StockCodeXYZ,StockCode123

Invoice002...etc

However, the data reads in repeating the Invoice for each StockCode like this:

Invoice001,CustomerID,Country,StockCodeXYZ

Invoice001,CustomerID,Country,StockCode123

Invoice002....etc

I've been trying pivot_wider() and then unite() , but it generates 285M+ MOSTLY NULL cells into a LIST which I'm having a hard time resolving and unable to write to csv or read into arules . I've also tried keep(~.is.null(,)). discard(is,null) compact() keep(~.is.null(,)). discard(is,null) compact() without success and am open to any method to achieve the desired outcome above.

However, I feel like I should be able to solve it using the built-in arules() read.transactions() fx , but am getting various errors as I try different things there too.

The data is opensource from University of California, Irvin and found here: https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx

Any help would be greatly appreciated.

library(readxl)
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
destfile <- "Online_20Retail.xlsx"
curl::curl_download(url, destfile)
Online_20Retail <- read_excel(destfile)

trans <- read.transactions(????????????)

this one invoice "573585" hast over 1.000 itens so it will generate the acording number of columns if you only get the stock number from the invoice items... still we have a bit over 1.000 columns.

library(dplyr)


Online_20Retail %>% 
    dplyr::transmute(new = paste0(InvoiceNo, ", ", 
                                  CustomerID, ", ", 
                                  Country, ", "), 
                     StockCode) %>% 
    dplyr::group_by(new) %>% 
    dplyr::summarise(output = paste(StockCode, collapse = ", ")) %>%
    dplyr::transmute(mystring = paste0(new, output)) 
    # you might want to put "%>% dplyr::pull(mystring)" at the ending of the line above to get a vector not tibble/dataframe


# A tibble: 25,900 x 1
   mystring                                                                                                                                         
   <chr>                                                                                                                                            
 1 536365, 17850, United Kingdom, 85123A, 71053, 84406B, 84029G, 84029E, 22752, 21730                                                               
 2 536366, 17850, United Kingdom, 22633, 22632                                                                                                      
 3 536367, 13047, United Kingdom, 84879, 22745, 22748, 22749, 22310, 84969, 22623, 22622, 21754, 21755, 21777, 48187                                
 4 536368, 13047, United Kingdom, 22960, 22913, 22912, 22914                                                                                        
 5 536369, 13047, United Kingdom, 21756                                                                                                             
 6 536370, 12583, France, 22728, 22727, 22726, 21724, 21883, 10002, 21791, 21035, 22326, 22629, 22659, 22631, 22661, 21731, 22900, 21913, 22540, 22~
 7 536371, 13748, United Kingdom, 22086                                                                                                             
 8 536372, 17850, United Kingdom, 22632, 22633                                                                                                      
 9 536373, 17850, United Kingdom, 85123A, 71053, 84406B, 20679, 37370, 21871, 21071, 21068, 82483, 82486, 82482, 82494L, 84029G, 84029E, 22752, 217~
10 536374, 15100, United Kingdom, 21258                                                                                                             
# ... with 25,890 more rows

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM