简体   繁体   中英

Market basket analysis using python for large data set with millions of rows

I'm trying to do a market basket analysis on a very large dataset of about 4800 unique products and 2-3 millions of rows. I'm using pyodbc to get data from sql server database.

I will eventually have two columns left invoice no and product no to do the processing. No of unique items in product no column is let's say about 4800 and it's 3 years data for one store. I've to do analysis for multiple stores, around 10-12 stores, with at max 5 stores in one set of analysis.

Even if I reduce the data to 1 year, it's a lot.

Does anyone know what's the efficient approach for handling this much amount of data for market basket analysis using python?

You have to cleanse some of the data. I am tackling the same problem. You will encounter one major issue, eg. the company you work for is 7-11, with customers coming in a purchase 1 item only. This will mess up your data. You have to groupby the Invoice No. and.= 1.., I'm still solving how to do this. but this will clear up so much for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM