简体繁体 English

使用 python 对具有数百万行的大型数据集进行市场篮分析

[英]Market basket analysis using python for large data set with millions of rows

原文 2021-02-26 14:15:08 4 1 python/ data-mining/ large-data/ memory-efficient/ market-basket-analysis

I'm trying to do a market basket analysis on a very large dataset of about 4800 unique products and 2-3 millions of rows.我正在尝试对大约 4800 种独特产品和 2-3 百万行的非常大的数据集进行市场购物篮分析。 I'm using pyodbc to get data from sql server database.我正在使用 pyodbc 从 sql 服务器数据库中获取数据。

I will eventually have two columns left invoice no and product no to do the processing.我最终将有两列留下发票编号和产品编号来进行处理。 No of unique items in product no column is let's say about 4800 and it's 3 years data for one store.产品中的唯一商品数量没有列，比如说大约 4800 件，这是一家商店的 3 年数据。 I've to do analysis for multiple stores, around 10-12 stores, with at max 5 stores in one set of analysis.我必须对多家商店进行分析，大约 10-12 家商店，一组分析中最多 5 家商店。

Even if I reduce the data to 1 year, it's a lot.即使我将数据减少到 1 年，也很多。

Does anyone know what's the efficient approach for handling this much amount of data for market basket analysis using python?有谁知道使用 python 处理大量数据以进行市场篮分析的有效方法是什么？

1 个解决方案

You have to cleanse some of the data.您必须清理一些数据。 I am tackling the same problem.我正在解决同样的问题。 You will encounter one major issue, eg.您将遇到一个主要问题，例如。 the company you work for is 7-11, with customers coming in a purchase 1 item only.您工作的公司是 7-11，客户只购买 1 件商品。 This will mess up your data.这会弄乱你的数据。 You have to groupby the Invoice No. and.= 1.., I'm still solving how to do this.您必须按发票编号和.= 1 进行分组，我仍在解决如何做到这一点。 but this will clear up so much for you.但这对你来说很清楚。