执行市场篮子分析的高效算法

Question

I want to perform Market Basket Analysis (or Association Analysis) on retail ecommerce dataset.我想对零售电子商务数据集执行购物篮分析（或关联分析）。

The problem I am facing is the huge data size of 3.3 million transactions in a single month.我面临的问题是单月 330 万笔交易的庞大数据量。 I cannot cut down the transactions as I may miss some products.我不能减少交易，因为我可能会错过一些产品。 Provided below the structure of the data:下面提供数据结构：

Order_ID = Unique transaction identifier Order_ID = 唯一交易标识符

Customer_ID = Identifier of the customer who placed the order Customer_ID = 下订单的客户的标识符

Product_ID = List of all the products the customer has purchased Product_ID = 客户已购买的所有产品的列表

Date = Date on which the sale has happened日期 = 销售发生的日期

When I feed this data to the #apriori algorithm in Python, my system cannot handle the huge memory requirements to run.当我将此数据提供给 Python 中的 #apriori 算法时，我的系统无法处理运行所需的巨大 memory 要求。 It can run with just 100K transactions.它只需 100K 事务即可运行。 I have 16gb RAM.我有 16GB 内存。

Any help in suggesting a better (and faster) algorithm is much appreciated.非常感谢您提出更好（更快）算法的任何帮助。

I can use SQL as well to sort out data size issues, but I will get only 1 Antecedent --> 1 Consequent rule.我也可以使用 SQL 来解决数据大小问题，但我只会得到 1 个前件 --> 1 个后件规则。 Is there a way to get multiset rules such as {A,B,C} --> {D,E} ie, If a customer purchases products A, B and C, then there is a high chance to purchase products D and E.有没有办法获得多集规则，例如 {A,B,C} --> {D,E} 即，如果客户购买产品 A、B 和 C，那么很有可能购买产品 D 和 E .

Answer 1

For a huge data size try FP Growth , as it is an improvement to the Apriori method.对于庞大的数据量，请尝试FP Growth ，因为它是对 Apriori 方法的改进。 It also only loop data twice when compared to Apriori.与 Apriori 相比，它也只循环数据两次。

from mlxtend.frequent_patterns import fpgrowth

Then just change:然后只需更改：

apriori(df, min_support=0.6)

To至

fpgrowth(df, min_support=0.6)

There also an research that compare each algorithm, for memory issue I recommend: Evaluation of Apriori, FP growth and Eclat association rule miningalgorithms or Comparing the Performance of Frequent Pattern Mining Algorithms .还有一项比较每种算法的研究，对于 memory 问题，我推荐：评估 Apriori、FP 增长和 Eclat 关联规则挖掘算法或比较频繁模式挖掘算法的性能。

执行市场篮子分析的高效算法

问题描述

1 个解决方案

解决方案1
0 2022-09-13 07:47:29

执行市场篮子分析的高效算法

问题描述

1 个解决方案

解决方案1 0 2022-09-13 07:47:29

解决方案1
0 2022-09-13 07:47:29