[英]Market bucket analysis on large data set
I am creating a market bucket analysis on a large data set containing 2 columns (OrderID and Product). 我正在对包含2列(OrderID和Product)的大型数据集进行市场桶分析。 There are over a million rows in the set, and using the apriori packagage I was able to create an effective rules list using a smaller subset of the data, however when attempting to use the full set, I am not able to use the split function to aggregate the data by OrderID.
集合中有超过一百万行,使用先验打包,我能够使用较小的数据子集创建有效的规则列表,但是当尝试使用完整集时,我无法使用split函数通过OrderID汇总数据。 Is there another function with similar functionality to split that can handle this much data?
是否有另一个具有类似功能的功能可以处理这么多数据? Code listed below:
代码如下:
MyData <- read.csv("C:/Market Basket Analysis/BOD16-Data.csv") #Abreviated for proprietary reasons
View(MyData)
library(arules)
summary(MyData)
#Using the split function, we are able to aggregate the transactions, so that each
#product on the transaction is grouped into its respective, singular, transID
start.time <- Sys.time() #Timer used to measure run time on the split function
aggregateData <- split(MyData$Product, MyData$OrderID)
end.time<- Sys.time()
time.taken = end.time- start.time
time.taken
#Using the split function, we are able to aggregate the transactions, so that each
#product on the transaction is grouped into its respective, singular, transID
aggregateData <- split(MyData$Product, MyData$OrderID)
head(aggregateData)
#Need to convert the aggregated data into a form that 'Arules' package
#can accept
txns <- as(aggregateData, "transactions")
#txns <- read.transactions("Trans", format = "basket", sep=",", rm.duplicates=TRUE)
summary(txns)
#Apriori Algorithem generates the rules
Rules <- apriori(txns,parameter=list(supp=0.0025,conf=0.4,target="Rules",minlen=2))
inspect(Rules)
EDIT: My data would be as follow: 编辑:我的数据如下:
OrderId Product
1 1234
1 1357
1 2468
1 1324
2 1234
2 2468
3 4321
4 5432
5 1357
AggregateData should be:
[1]
1234,1357,2468,1324
[2]
1234, 2468
[3]
4321
[4]
5432
[5]
1357
Currently I am using the split function to achieve these results, but when applying it to a larger set the runtime exceeded 30 minutes before I stopped the script. 目前,我正在使用split函数来实现这些结果,但是将其应用于更大的集合时,运行时间超过了我停止脚本之前的30分钟。
Is this any faster for you? 这对您来说更快吗?
library(dplyr)
df <- tribble(
~OrderId, ~Product,
1, 1234,
1, 1357,
1, 2468,
1, 1324,
2, 1234,
2, 2468,
3, 4321,
4, 5432,
5, 1357
)
df %>%
group_by(OrderId) %>%
summarize(Product = list(Product)) %>%
mutate(Product = purrr::set_names(Product, OrderId)) %>%
pull(Product)
So for your code you should be able to do: 因此,对于您的代码,您应该能够:
library(dplyr)
MyData <- read.csv("C:/Market Basket Analysis/BOD16-Data.csv")
aggregateData <- MyData %>%
group_by(OrderId) %>%
summarize(Product = list(Product)) %>%
mutate(Product = purrr::set_names(Product, OrderId)) %>%
pull(Product)
And that should be the same (and hopefully faster) as doing: 这应该与执行操作相同(并希望更快):
MyData <- read.csv("C:/Market Basket Analysis/BOD16-Data.csv")
aggregateData <- split(MyData$Product, MyData$OrderID)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.