简体   繁体   English

执行市场篮子分析的高效算法

[英]Efficient algorithms to perform Market Basket Analysis

I want to perform Market Basket Analysis (or Association Analysis) on retail ecommerce dataset.我想对零售电子商务数据集执行购物篮分析(或关联分析)。

The problem I am facing is the huge data size of 3.3 million transactions in a single month.我面临的问题是单月 330 万笔交易的庞大数据量。 I cannot cut down the transactions as I may miss some products.我不能减少交易,因为我可能会错过一些产品。 Provided below the structure of the data:下面提供数据结构:

Order_ID = Unique transaction identifier Order_ID = 唯一交易标识符

Customer_ID = Identifier of the customer who placed the order Customer_ID = 下订单的客户的标识符

Product_ID = List of all the products the customer has purchased Product_ID = 客户已购买的所有产品的列表

Date = Date on which the sale has happened日期 = 销售发生的日期

When I feed this data to the #apriori algorithm in Python, my system cannot handle the huge memory requirements to run.当我将此数据提供给 Python 中的 #apriori 算法时,我的系统无法处理运行所需的巨大 memory 要求。 It can run with just 100K transactions.它只需 100K 事务即可运行。 I have 16gb RAM.我有 16GB 内存。

Any help in suggesting a better (and faster) algorithm is much appreciated.非常感谢您提出更好(更快)算法的任何帮助。

I can use SQL as well to sort out data size issues, but I will get only 1 Antecedent --> 1 Consequent rule.我也可以使用 SQL 来解决数据大小问题,但我只会得到 1 个前件 --> 1 个后件规则。 Is there a way to get multiset rules such as {A,B,C} --> {D,E} ie, If a customer purchases products A, B and C, then there is a high chance to purchase products D and E.有没有办法获得多集规则,例如 {A,B,C} --> {D,E} 即,如果客户购买产品 A、B 和 C,那么很有可能购买产品 D 和 E .

For a huge data size try FP Growth , as it is an improvement to the Apriori method.对于庞大的数据量,请尝试FP Growth ,因为它是对 Apriori 方法的改进 It also only loop data twice when compared to Apriori.与 Apriori 相比,它也只循环数据两次。

from mlxtend.frequent_patterns import fpgrowth

Then just change:然后只需更改:

apriori(df, min_support=0.6)

To

fpgrowth(df, min_support=0.6)

There also an research that compare each algorithm, for memory issue I recommend: Evaluation of Apriori, FP growth and Eclat association rule miningalgorithms or Comparing the Performance of Frequent Pattern Mining Algorithms .还有一项比较每种算法的研究,对于 memory 问题,我推荐: 评估 Apriori、FP 增长和 Eclat 关联规则挖掘算法或比较频繁模式挖掘算法的性能

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM