I have a pandas dataframe of about 40k entries in the following format:
invoiceNo | item
import pandas as pd
df = pd.DataFrame({'invoiceNo': ['123', '123', '124', '124'],
'item': ['plant', 'grass', 'hammer', 'screwdriver']})
Let's say a customer can buy several items under one single invoice number.
Is there a way for me to check what items get bought together the most?
The first thing I tried was to get all unique IDs to loop through
unique_invoice_id = df.invoiceNo.unique().tolist()
Thanks!
Without loss of generality, I'm going to use lists instead of a dataframe. You can easily extract the lists required from the dataframe if necessary.
from itertools import combinations
from collections import defaultdict
x = [1, 1, 1, 2, 2, 2, 3, 3, 3] # invoice number
y = ['a', 'b', 'c', 'a', 'c', 'e', 'a', 'c', 'd'] # item
z = defaultdict(set)
for i, j in zip(x, y):
z[i].add(j)
print(z)
d = defaultdict(int)
for i in range(2, len(set(y))):
combs = combinations(set(y), i)
for comb in combs:
for k, v in z.items():
if set(comb).issubset(set(v)):
d[tuple(comb)] += 1
list(reversed(sorted([[v, k] for k, v in d.items()])))
# [[3, ('c', 'a')],
# [1, ('d', 'c', 'a')],
# [1, ('d', 'c')],
# [1, ('d', 'a')],
# [1, ('c', 'e')],
# [1, ('c', 'a', 'e')],
# [1, ('b', 'c', 'a')],
# [1, ('b', 'c')],
# [1, ('b', 'a')],
# [1, ('a', 'e')]]
Interpretation is 'c' and 'a' were bought together 3 times, etc.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.