简体   繁体   中英

Find identical groups in python data frames pandas

I am trying to find the identical orders in my dataframe, that looks similar to this -

Order_ID |SKU |Qty |

123 | A | 1 |

123 | B | 2 |

345 | A | 1 |

345 | B | 2 |

678 | A | 1 |

678 | C | 3 |

There can be multiple SKUs in an order, ie, 1 order can have multiple rows. So the order_ID that contains the exact SKU and qty are identical. Here 123 and 345. I need the orders that are identical along with SKUs and quantities.

How can I achieve that in pandas dataframe using grouping?

Sample Output would be something like -

Order_ID     |   SKU    | Qty        |Unique_Orders
[123] , [345]| [A],[B]  | [1],[2]    |2
[678]        | [A],[C]  | [1],[3]    |1

Thanks for your help.

Update

Based on an update in the question, here is an updated answer, without any Python-level loops:

skuqty = df.groupby('Order_ID')[['SKU', 'Qty']].agg(tuple).reset_index()
skuqty.groupby(['SKU', 'Qty'])['Order_ID'].unique().reset_index()

Which gives:

      SKU     Qty    Order_ID
0  (A, B)  (1, 2)  [123, 345]
1  (A, C)  (1, 3)       [678]

Or, if you want to match your specifications exactly, you can further do:

z = skuqty.groupby(['SKU', 'Qty'])['Order_ID'].unique().reset_index()
z = z.assign(SKU=z['SKU'].apply(list)).assign(Qty=z['Qty'].apply(list)).assign(Unique_Orders=z['Order_ID'].apply(len))
z = z[['Order_ID', 'SKU', 'Qty', 'Unique_Orders']]

Which gives:

>>> z
     Order_Id     SKU     Qty  Unique_Orders
0  [123, 345]  [A, B]  [1, 2]              2
1       [678]  [A, C]  [1, 3]              1

Speed

This is relatively fast:

n = 1_000_000
df = pd.DataFrame({
    'Order_ID': np.random.randint(0, 999, n),
    'SKU': np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'), n),
    'Qty': np.random.randint(1, 100, n),
})

%timeit proc(df)  # which is the (first) code above
# 405 ms ± 407 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Original answer

It depends what you want to do with the groups. Here is an example that sums the Qty :

df.groupby('Order_ID')['Qty'].sum()

Gives:

Order_ID
123    3
345    3
678    4
Name: Qty, dtype: int64

Or, if you want to simultaneously see the Qty total and the distinct SKU :

>>> df.groupby('Order_ID').agg({'Qty':sum, 'SKU':'unique'})
          Qty     SKU
Order_ID             
123         3  [A, B]
345         3  [A, B]
678         4  [A, C]

Finally, there is one that gives you a dict of {SKU: Qty} for each Order_ID :

>>> df.groupby('Order_ID').apply(lambda g: dict(g[['SKU', 'Qty']].values))
Order_ID
123    {'A': 1, 'B': 2}
345    {'A': 1, 'B': 2}
678    {'A': 1, 'C': 3}

We can use groupby + unique to get the unique orders per SKU and Qty

df.groupby(['SKU', 'Qty'])['Order_ID'].unique()

If you also want to count the number of unique order then we can additionally use nunique

df.groupby(['SKU', 'Qty'])['Order_ID'].agg(['unique', 'nunique'])

                  unique  nunique
SKU Qty                          
A   1    [123, 345, 678]        3
B   2         [123, 345]        2
C   3              [678]        1
df.groupby(['SKU', 'Qty'])['Order_ID'].apply(list)

Another version:

x = df.groupby("Order_ID")[["SKU", "Qty"]].apply(
    lambda x: frozenset(zip(x.SKU, x.Qty))
)

df_out = pd.DataFrame(
    [
        {
            "Order_ID": v.to_list(),
            "SKU": [sku for sku, _ in k],
            "Qty": [qty for _, qty in k],
            "Unique_Orders": len(v),
        }
        for k, v in x.index.groupby(x).items()
    ]
)
print(df_out)

Prints:

     Order_ID     SKU     Qty  Unique_Orders
0  [123, 345]  [A, B]  [1, 2]              2
1       [678]  [C, A]  [3, 1]              1

You dont need to use group in this case. Just use the duplicated() function in pandas.

df.duplicated()

This will return a boolean series with the first duplicated value showed as True and the others similar value followed the first one to be False.

So if you want to retrieve the duplicated ID, just follow normal pandas conditions.

df['Order_ID'].loc[df.duplicated()].values.unique()

Assuming the Order_ID is a column in the DataFrame and the default id columns still exist.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM