简体   繁体   English

在 python 数据帧 pandas 中查找相同的组

[英]Find identical groups in python data frames pandas

I am trying to find the identical orders in my dataframe, that looks similar to this -我试图在我的 dataframe 中找到相同的订单,看起来与此类似 -

Order_ID |SKU |Qty |

123 | A | 1 |

123 | B | 2 |

345 | A | 1 |

345 | B | 2 |

678 | A | 1 |

678 | C | 3 |

There can be multiple SKUs in an order, ie, 1 order can have multiple rows.一个订单可以有多个SKU,即1个订单可以有多行。 So the order_ID that contains the exact SKU and qty are identical.所以包含确切 SKU 和数量的 order_ID 是相同的。 Here 123 and 345. I need the orders that are identical along with SKUs and quantities.这里是 123 和 345。我需要与 SKU 和数量相同的订单。

How can I achieve that in pandas dataframe using grouping?如何使用分组在 pandas dataframe 中实现这一点?

Sample Output would be something like -样品 Output 将类似于 -

Order_ID     |   SKU    | Qty        |Unique_Orders
[123] , [345]| [A],[B]  | [1],[2]    |2
[678]        | [A],[C]  | [1],[3]    |1

Thanks for your help.谢谢你的帮助。

Update更新

Based on an update in the question, here is an updated answer, without any Python-level loops:根据问题中的更新,这是一个更新的答案,没有任何 Python 级循环:

skuqty = df.groupby('Order_ID')[['SKU', 'Qty']].agg(tuple).reset_index()
skuqty.groupby(['SKU', 'Qty'])['Order_ID'].unique().reset_index()

Which gives:这使:

      SKU     Qty    Order_ID
0  (A, B)  (1, 2)  [123, 345]
1  (A, C)  (1, 3)       [678]

Or, if you want to match your specifications exactly, you can further do:或者,如果您想完全匹配您的规格,您可以进一步执行以下操作:

z = skuqty.groupby(['SKU', 'Qty'])['Order_ID'].unique().reset_index()
z = z.assign(SKU=z['SKU'].apply(list)).assign(Qty=z['Qty'].apply(list)).assign(Unique_Orders=z['Order_ID'].apply(len))
z = z[['Order_ID', 'SKU', 'Qty', 'Unique_Orders']]

Which gives:这使:

>>> z
     Order_Id     SKU     Qty  Unique_Orders
0  [123, 345]  [A, B]  [1, 2]              2
1       [678]  [A, C]  [1, 3]              1

Speed速度

This is relatively fast:这是相对较快的:

n = 1_000_000
df = pd.DataFrame({
    'Order_ID': np.random.randint(0, 999, n),
    'SKU': np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'), n),
    'Qty': np.random.randint(1, 100, n),
})

%timeit proc(df)  # which is the (first) code above
# 405 ms ± 407 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Original answer原始答案

It depends what you want to do with the groups.这取决于你想对这些组做什么。 Here is an example that sums the Qty :这是一个汇总Qty的示例:

df.groupby('Order_ID')['Qty'].sum()

Gives:给出:

Order_ID
123    3
345    3
678    4
Name: Qty, dtype: int64

Or, if you want to simultaneously see the Qty total and the distinct SKU :或者,如果您想同时查看Qty total 和不同的SKU

>>> df.groupby('Order_ID').agg({'Qty':sum, 'SKU':'unique'})
          Qty     SKU
Order_ID             
123         3  [A, B]
345         3  [A, B]
678         4  [A, C]

Finally, there is one that gives you a dict of {SKU: Qty} for each Order_ID :最后,有一个为每个dict提供{SKU: Qty}Order_ID

>>> df.groupby('Order_ID').apply(lambda g: dict(g[['SKU', 'Qty']].values))
Order_ID
123    {'A': 1, 'B': 2}
345    {'A': 1, 'B': 2}
678    {'A': 1, 'C': 3}

We can use groupby + unique to get the unique orders per SKU and Qty我们可以使用groupby + unique来获取每个SKUQty的唯一订单

df.groupby(['SKU', 'Qty'])['Order_ID'].unique()

If you also want to count the number of unique order then we can additionally use nunique如果您还想count unique订单的数量,那么我们可以另外使用nunique

df.groupby(['SKU', 'Qty'])['Order_ID'].agg(['unique', 'nunique'])

                  unique  nunique
SKU Qty                          
A   1    [123, 345, 678]        3
B   2         [123, 345]        2
C   3              [678]        1
df.groupby(['SKU', 'Qty'])['Order_ID'].apply(list)

Another version:另一个版本:

x = df.groupby("Order_ID")[["SKU", "Qty"]].apply(
    lambda x: frozenset(zip(x.SKU, x.Qty))
)

df_out = pd.DataFrame(
    [
        {
            "Order_ID": v.to_list(),
            "SKU": [sku for sku, _ in k],
            "Qty": [qty for _, qty in k],
            "Unique_Orders": len(v),
        }
        for k, v in x.index.groupby(x).items()
    ]
)
print(df_out)

Prints:印刷:

     Order_ID     SKU     Qty  Unique_Orders
0  [123, 345]  [A, B]  [1, 2]              2
1       [678]  [C, A]  [3, 1]              1

You dont need to use group in this case.在这种情况下,您不需要使用组。 Just use the duplicated() function in pandas.只需在 pandas 中使用 duplicated() function。

df.duplicated()

This will return a boolean series with the first duplicated value showed as True and the others similar value followed the first one to be False.这将返回一个 boolean 系列,其中第一个重复值显示为 True,其他类似的值跟随第一个为 False。

So if you want to retrieve the duplicated ID, just follow normal pandas conditions.因此,如果您想检索重复的 ID,只需遵循正常的 pandas 条件即可。

df['Order_ID'].loc[df.duplicated()].values.unique()

Assuming the Order_ID is a column in the DataFrame and the default id columns still exist.假设 Order_ID 是 DataFrame 中的一列,并且默认的 id 列仍然存在。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM