简体   繁体   中英

Create all possible permutations of a column partitioned by another column in a Pandas Dataframe

I have dataframe that looks like this:

当前状态

My aim is to get at:

最终国家

Explanation:

  1. Every customer has made 3 orders
  2. One can buy from as many Categories in each order
  3. Desired state: Get all possible permutations of Categories a customer purchased by order sequence. The second picture would help understand this better
  4. Category1 in desired state represents Categories purchased in first order, Category2 represents Categories purchased in second order and so on.

Code I'm using:

start_time = time.time()

df = pd.DataFrame()
for CustomerName in base_df.CustomerName.unique():
    df1 = base_df[(base_df['CustomerName']== CustomerName)][['CustomerName','order_seq','Category']]
    df2 = pd.DataFrame(index=pd.MultiIndex.from_product([subdf['Category'] for p, subdf in df1.groupby(['order_seq'])], names = df1.order_seq.unique())).reset_index()
    df2['CustomerName'] = CustomerName
    df = df.append(df2)

print("--- %s seconds ---" %(time.time() - start_time))

This takes about 10 mins to run on my dataset - Looking for a faster method.

I am working on Pandas right now, but pointers for R or SQL are also welcome!Thank you!

Consider a merge of three OrderSequence dataframes, each joined to a distinct CustomerName :

import pandas as pd

df = pd.DataFrame({'CustomerName': [1,1,1,1,1,1,1,2,2,2,3,3,3,3],
                   'OrderSequence': [1,2,2,2,3,3,3,1,2,3,1,1,2,3],
                   'Category': ['Food','Food','Clothes','Furniture','Clothes','Food','Toys',
                                'Clothes','Toys','Food','Furniture','Toys','Food','Food']})

finaldf = pd.DataFrame(df['CustomerName'].drop_duplicates())

for i in range(1,4):
    seqdf = df[df['OrderSequence']==i][['CustomerName', 'Category']].\               
                                      rename(columns={'Category':'Category'+str(i)})
    finaldf = pd.merge(finaldf, seqdf, on=['CustomerName'])

print(finaldf)

#     CustomerName  Category1  Category2 Category3
# 0              1       Food       Food   Clothes
# 1              1       Food       Food      Food
# 2              1       Food       Food      Toys
# 3              1       Food    Clothes   Clothes
# 4              1       Food    Clothes      Food
# 5              1       Food    Clothes      Toys
# 6              1       Food  Furniture   Clothes
# 7              1       Food  Furniture      Food
# 8              1       Food  Furniture      Toys
# 9              2    Clothes       Toys      Food
# 10             3  Furniture       Food      Food
# 11             3       Toys       Food      Food

Admittedly, the above setup was first thought out in SQL using self joins, then translated to pandas:

SELECT t1.CustomerName, t2.Category AS Category1, 
       t3.Category AS Category2, t4.Category AS Category3

FROM (SELECT DISTINCT CustomerName FROM DataFrame) AS t1 
INNER JOIN DataFrame AS t2 
ON t1.CustomerName = t2.CustomerName 
INNER JOIN DataFrame AS t3
ON t1.CustomerName = t3.CustomerName 
INNER JOIN DataFrame AS t4
ON t1.CustomerName = t4.CustomerName

WHERE (t2.OrderSequence=1) AND (t3.OrderSequence=2) AND (t4.OrderSequence=3);

okay. took some work but i did it. hope it helps.

import pandas as pd
import numpy as np
from itertools import combinations

df = pd.DataFrame([], columns=['CustomerName','Order Sequence','Category'])

df['CustomerName'] = [1,1,1,1,1,1,1,2,2,2,3,3,3,3]
df['Order Sequence'] = [1,2,2,2,3,3,3,1,2,3,1,1,2,3]
df['Category'] = ['Food','Food','Clothes','Furniture','Clothes','Food','Toys','Clothes','Toys','Food','Furniture','Toys','Food','Food']

df2 = pd.DataFrame([], columns=['CustomerName','Category1','Category2','Category3'])

for CN in sorted(set(df['CustomerName'])):

    df_temp = pd.DataFrame([], columns=['CustomerName','Category1','Category2','Category3'])

    list_OS_1 = []
    list_OS_2 = []
    list_OS_3 = []

    MMC = reduce(lambda x, y: x*y,df.loc[df['CustomerName']==CN, 'Order Sequence'].value_counts().values)

    for N in np.arange(MMC / len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==1)), 'Category'])):

        for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==1)), 'Category']:

            list_OS_1.append(CTG) 

    for N in np.arange(MMC / len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==2)), 'Category'])):

        for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==2)), 'Category']:

            list_OS_2.append(CTG) 

    for N in np.arange(MMC / len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==3)), 'Category'])):

        for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==3)), 'Category']:

            list_OS_3.append(CTG) 

    df_temp['Category1'] = list_OS_1
    df_temp['Category2'] = list_OS_2
    df_temp['Category3'] = list_OS_3
    df_temp['CustomerName'] = CN

    df2 = pd.concat([df2,df_temp],0)

print (df2)

output:

   CustomerName  Category1  Category2 Category3
0           1.0       Food       Food   Clothes
1           1.0       Food    Clothes      Food
2           1.0       Food  Furniture      Toys
3           1.0       Food       Food   Clothes
4           1.0       Food    Clothes      Food
5           1.0       Food  Furniture      Toys
6           1.0       Food       Food   Clothes
7           1.0       Food    Clothes      Food
8           1.0       Food  Furniture      Toys
0           2.0    Clothes       Toys      Food
0           3.0  Furniture       Food      Food
1           3.0       Toys       Food      Food

ps: its not dinamic, so if you add or remove categories it ll get fcked over. but as long as it follows the initial standard you passed me, it shld work

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM