I want to perform Market Basket Analysis (or Association Analysis) on retail ecommerce dataset.
The problem I am facing is the huge data size of 3.3 million transactions in a single month. I cannot cut down the transactions as I may miss some products. Provided below the structure of the data:
Order_ID = Unique transaction identifier
Customer_ID = Identifier of the customer who placed the order
Product_ID = List of all the products the customer has purchased
Date = Date on which the sale has happened
When I feed this data to the #apriori algorithm in Python, my system cannot handle the huge memory requirements to run. It can run with just 100K transactions. I have 16gb RAM.
Any help in suggesting a better (and faster) algorithm is much appreciated.
I can use SQL as well to sort out data size issues, but I will get only 1 Antecedent --> 1 Consequent rule. Is there a way to get multiset rules such as {A,B,C} --> {D,E} ie, If a customer purchases products A, B and C, then there is a high chance to purchase products D and E.
For a huge data size try FP Growth , as it is an improvement to the Apriori method. It also only loop data twice when compared to Apriori.
from mlxtend.frequent_patterns import fpgrowth
Then just change:
apriori(df, min_support=0.6)
To
fpgrowth(df, min_support=0.6)
There also an research that compare each algorithm, for memory issue I recommend: Evaluation of Apriori, FP growth and Eclat association rule miningalgorithms or Comparing the Performance of Frequent Pattern Mining Algorithms .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.