大数据集的市场桶分析

Question

I am creating a market bucket analysis on a large data set containing 2 columns (OrderID and Product). 我正在对包含2列（OrderID和Product）的大型数据集进行市场桶分析。 There are over a million rows in the set, and using the apriori packagage I was able to create an effective rules list using a smaller subset of the data, however when attempting to use the full set, I am not able to use the split function to aggregate the data by OrderID. 集合中有超过一百万行，使用先验打包，我能够使用较小的数据子集创建有效的规则列表，但是当尝试使用完整集时，我无法使用split函数通过OrderID汇总数据。 Is there another function with similar functionality to split that can handle this much data? 是否有另一个具有类似功能的功能可以处理这么多数据？ Code listed below: 代码如下：

MyData <- read.csv("C:/Market Basket Analysis/BOD16-Data.csv") #Abreviated for proprietary reasons
View(MyData)

library(arules)
summary(MyData)

#Using the split function, we are able to aggregate the transactions, so that each
#product on the transaction is grouped into its respective, singular, transID

start.time <- Sys.time() #Timer used to measure run time on the split function
aggregateData <- split(MyData$Product, MyData$OrderID)
end.time<- Sys.time()

time.taken = end.time- start.time
time.taken


#Using the split function, we are able to aggregate the transactions, so that each
#product on the transaction is grouped into its respective, singular, transID
aggregateData <- split(MyData$Product, MyData$OrderID)
head(aggregateData)

#Need to convert the aggregated data into a form that 'Arules' package 
#can accept
txns <- as(aggregateData, "transactions")
#txns <- read.transactions("Trans", format = "basket", sep=",", rm.duplicates=TRUE)
summary(txns)


#Apriori Algorithem generates the rules 
Rules <- apriori(txns,parameter=list(supp=0.0025,conf=0.4,target="Rules",minlen=2))
inspect(Rules)

EDIT: My data would be as follow: 编辑：我的数据如下：

OrderId     Product
1       1234
1       1357
1       2468
1       1324
2       1234
2       2468
3       4321
4       5432
5       1357

AggregateData should be:

[1]
1234,1357,2468,1324

[2]
1234, 2468

[3]
4321

[4]
5432

[5]
1357

Currently I am using the split function to achieve these results, but when applying it to a larger set the runtime exceeded 30 minutes before I stopped the script. 目前，我正在使用split函数来实现这些结果，但是将其应用于更大的集合时，运行时间超过了我停止脚本之前的30分钟。

Answer 1

Is this any faster for you? 这对您来说更快吗？

library(dplyr) 

df <- tribble(
    ~OrderId, ~Product,
    1,       1234,
    1,       1357,
    1,       2468,
    1,       1324,
    2,       1234,
    2,       2468,
    3,       4321,
    4,       5432,
    5,       1357
    )

    df %>% 
         group_by(OrderId) %>% 
         summarize(Product = list(Product)) %>% 
         mutate(Product = purrr::set_names(Product, OrderId)) %>% 
         pull(Product)

So for your code you should be able to do: 因此，对于您的代码，您应该能够：

library(dplyr)

MyData <- read.csv("C:/Market Basket Analysis/BOD16-Data.csv")


 aggregateData <-   MyData %>% 
          group_by(OrderId) %>% 
          summarize(Product = list(Product)) %>% 
          mutate(Product = purrr::set_names(Product, OrderId)) %>% 
          pull(Product)

And that should be the same (and hopefully faster) as doing: 这应该与执行操作相同（并希望更快）：

MyData <- read.csv("C:/Market Basket Analysis/BOD16-Data.csv")

aggregateData <- split(MyData$Product, MyData$OrderID)

大数据集的市场桶分析

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-06-08 18:41:03

大数据集的市场桶分析

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-06-08 18:41:03

解决方案1
1 已采纳 2018-06-08 18:41:03