[英]Find identical groups in python data frames pandas
I am trying to find the identical orders in my dataframe, that looks similar to this -我试图在我的 dataframe 中找到相同的订单,看起来与此类似 -
Order_ID |SKU |Qty |
123 | A | 1 |
123 | B | 2 |
345 | A | 1 |
345 | B | 2 |
678 | A | 1 |
678 | C | 3 |
There can be multiple SKUs in an order, ie, 1 order can have multiple rows.一个订单可以有多个SKU,即1个订单可以有多行。 So the order_ID that contains the exact SKU and qty are identical.
所以包含确切 SKU 和数量的 order_ID 是相同的。 Here 123 and 345. I need the orders that are identical along with SKUs and quantities.
这里是 123 和 345。我需要与 SKU 和数量相同的订单。
How can I achieve that in pandas dataframe using grouping?如何使用分组在 pandas dataframe 中实现这一点?
Sample Output would be something like -样品 Output 将类似于 -
Order_ID | SKU | Qty |Unique_Orders
[123] , [345]| [A],[B] | [1],[2] |2
[678] | [A],[C] | [1],[3] |1
Thanks for your help.谢谢你的帮助。
Update更新
Based on an update in the question, here is an updated answer, without any Python-level loops:根据问题中的更新,这是一个更新的答案,没有任何 Python 级循环:
skuqty = df.groupby('Order_ID')[['SKU', 'Qty']].agg(tuple).reset_index()
skuqty.groupby(['SKU', 'Qty'])['Order_ID'].unique().reset_index()
Which gives:这使:
SKU Qty Order_ID
0 (A, B) (1, 2) [123, 345]
1 (A, C) (1, 3) [678]
Or, if you want to match your specifications exactly, you can further do:或者,如果您想完全匹配您的规格,您可以进一步执行以下操作:
z = skuqty.groupby(['SKU', 'Qty'])['Order_ID'].unique().reset_index()
z = z.assign(SKU=z['SKU'].apply(list)).assign(Qty=z['Qty'].apply(list)).assign(Unique_Orders=z['Order_ID'].apply(len))
z = z[['Order_ID', 'SKU', 'Qty', 'Unique_Orders']]
Which gives:这使:
>>> z
Order_Id SKU Qty Unique_Orders
0 [123, 345] [A, B] [1, 2] 2
1 [678] [A, C] [1, 3] 1
Speed速度
This is relatively fast:这是相对较快的:
n = 1_000_000
df = pd.DataFrame({
'Order_ID': np.random.randint(0, 999, n),
'SKU': np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'), n),
'Qty': np.random.randint(1, 100, n),
})
%timeit proc(df) # which is the (first) code above
# 405 ms ± 407 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Original answer原始答案
It depends what you want to do with the groups.这取决于你想对这些组做什么。 Here is an example that sums the
Qty
:这是一个汇总
Qty
的示例:
df.groupby('Order_ID')['Qty'].sum()
Gives:给出:
Order_ID
123 3
345 3
678 4
Name: Qty, dtype: int64
Or, if you want to simultaneously see the Qty
total and the distinct SKU
:或者,如果您想同时查看
Qty
total 和不同的SKU
:
>>> df.groupby('Order_ID').agg({'Qty':sum, 'SKU':'unique'})
Qty SKU
Order_ID
123 3 [A, B]
345 3 [A, B]
678 4 [A, C]
Finally, there is one that gives you a dict
of {SKU: Qty}
for each Order_ID
:最后,有一个为每个
dict
提供{SKU: Qty}
的Order_ID
:
>>> df.groupby('Order_ID').apply(lambda g: dict(g[['SKU', 'Qty']].values))
Order_ID
123 {'A': 1, 'B': 2}
345 {'A': 1, 'B': 2}
678 {'A': 1, 'C': 3}
We can use groupby
+ unique
to get the unique orders per SKU
and Qty
我们可以使用
groupby
+ unique
来获取每个SKU
和Qty
的唯一订单
df.groupby(['SKU', 'Qty'])['Order_ID'].unique()
If you also want to count
the number of unique
order then we can additionally use nunique
如果您还想
count
unique
订单的数量,那么我们可以另外使用nunique
df.groupby(['SKU', 'Qty'])['Order_ID'].agg(['unique', 'nunique'])
unique nunique
SKU Qty
A 1 [123, 345, 678] 3
B 2 [123, 345] 2
C 3 [678] 1
df.groupby(['SKU', 'Qty'])['Order_ID'].apply(list)
Another version:另一个版本:
x = df.groupby("Order_ID")[["SKU", "Qty"]].apply(
lambda x: frozenset(zip(x.SKU, x.Qty))
)
df_out = pd.DataFrame(
[
{
"Order_ID": v.to_list(),
"SKU": [sku for sku, _ in k],
"Qty": [qty for _, qty in k],
"Unique_Orders": len(v),
}
for k, v in x.index.groupby(x).items()
]
)
print(df_out)
Prints:印刷:
Order_ID SKU Qty Unique_Orders
0 [123, 345] [A, B] [1, 2] 2
1 [678] [C, A] [3, 1] 1
You dont need to use group in this case.在这种情况下,您不需要使用组。 Just use the duplicated() function in pandas.
只需在 pandas 中使用 duplicated() function。
df.duplicated()
This will return a boolean series with the first duplicated value showed as True and the others similar value followed the first one to be False.这将返回一个 boolean 系列,其中第一个重复值显示为 True,其他类似的值跟随第一个为 False。
So if you want to retrieve the duplicated ID, just follow normal pandas conditions.因此,如果您想检索重复的 ID,只需遵循正常的 pandas 条件即可。
df['Order_ID'].loc[df.duplicated()].values.unique()
Assuming the Order_ID is a column in the DataFrame and the default id columns still exist.假设 Order_ID 是 DataFrame 中的一列,并且默认的 id 列仍然存在。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.