I am trying to find the identical orders in my dataframe, that looks similar to this -
Order_ID |SKU |Qty |
123 | A | 1 |
123 | B | 2 |
345 | A | 1 |
345 | B | 2 |
678 | A | 1 |
678 | C | 3 |
There can be multiple SKUs in an order, ie, 1 order can have multiple rows. So the order_ID that contains the exact SKU and qty are identical. Here 123 and 345. I need the orders that are identical along with SKUs and quantities.
How can I achieve that in pandas dataframe using grouping?
Sample Output would be something like -
Order_ID | SKU | Qty |Unique_Orders
[123] , [345]| [A],[B] | [1],[2] |2
[678] | [A],[C] | [1],[3] |1
Thanks for your help.
Update
Based on an update in the question, here is an updated answer, without any Python-level loops:
skuqty = df.groupby('Order_ID')[['SKU', 'Qty']].agg(tuple).reset_index()
skuqty.groupby(['SKU', 'Qty'])['Order_ID'].unique().reset_index()
Which gives:
SKU Qty Order_ID
0 (A, B) (1, 2) [123, 345]
1 (A, C) (1, 3) [678]
Or, if you want to match your specifications exactly, you can further do:
z = skuqty.groupby(['SKU', 'Qty'])['Order_ID'].unique().reset_index()
z = z.assign(SKU=z['SKU'].apply(list)).assign(Qty=z['Qty'].apply(list)).assign(Unique_Orders=z['Order_ID'].apply(len))
z = z[['Order_ID', 'SKU', 'Qty', 'Unique_Orders']]
Which gives:
>>> z
Order_Id SKU Qty Unique_Orders
0 [123, 345] [A, B] [1, 2] 2
1 [678] [A, C] [1, 3] 1
Speed
This is relatively fast:
n = 1_000_000
df = pd.DataFrame({
'Order_ID': np.random.randint(0, 999, n),
'SKU': np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'), n),
'Qty': np.random.randint(1, 100, n),
})
%timeit proc(df) # which is the (first) code above
# 405 ms ± 407 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Original answer
It depends what you want to do with the groups. Here is an example that sums the Qty
:
df.groupby('Order_ID')['Qty'].sum()
Gives:
Order_ID
123 3
345 3
678 4
Name: Qty, dtype: int64
Or, if you want to simultaneously see the Qty
total and the distinct SKU
:
>>> df.groupby('Order_ID').agg({'Qty':sum, 'SKU':'unique'})
Qty SKU
Order_ID
123 3 [A, B]
345 3 [A, B]
678 4 [A, C]
Finally, there is one that gives you a dict
of {SKU: Qty}
for each Order_ID
:
>>> df.groupby('Order_ID').apply(lambda g: dict(g[['SKU', 'Qty']].values))
Order_ID
123 {'A': 1, 'B': 2}
345 {'A': 1, 'B': 2}
678 {'A': 1, 'C': 3}
We can use groupby
+ unique
to get the unique orders per SKU
and Qty
df.groupby(['SKU', 'Qty'])['Order_ID'].unique()
If you also want to count
the number of unique
order then we can additionally use nunique
df.groupby(['SKU', 'Qty'])['Order_ID'].agg(['unique', 'nunique'])
unique nunique
SKU Qty
A 1 [123, 345, 678] 3
B 2 [123, 345] 2
C 3 [678] 1
df.groupby(['SKU', 'Qty'])['Order_ID'].apply(list)
Another version:
x = df.groupby("Order_ID")[["SKU", "Qty"]].apply(
lambda x: frozenset(zip(x.SKU, x.Qty))
)
df_out = pd.DataFrame(
[
{
"Order_ID": v.to_list(),
"SKU": [sku for sku, _ in k],
"Qty": [qty for _, qty in k],
"Unique_Orders": len(v),
}
for k, v in x.index.groupby(x).items()
]
)
print(df_out)
Prints:
Order_ID SKU Qty Unique_Orders
0 [123, 345] [A, B] [1, 2] 2
1 [678] [C, A] [3, 1] 1
You dont need to use group in this case. Just use the duplicated() function in pandas.
df.duplicated()
This will return a boolean series with the first duplicated value showed as True and the others similar value followed the first one to be False.
So if you want to retrieve the duplicated ID, just follow normal pandas conditions.
df['Order_ID'].loc[df.duplicated()].values.unique()
Assuming the Order_ID is a column in the DataFrame and the default id columns still exist.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.