根据两列的值删除数据帧pandas中的重复项

Question

I have a dataframe of customers with some items, which looks like this: 我有一些客户的数据框，其中包含一些项目，如下所示：

Customer ID     Item
     1         Banana
     1         Apple
     2         Orange
     3         Grape
     4         Banana
     4         Apple
     5         Orange
     5         Grape
     6         Orange

What I'm willing to do is to remove all duplicates customers with same items, so the results should look like this: 我愿意做的是删除所有具有相同项目的重复客户，因此结果应如下所示：

Customer ID     Item
     1         Banana
     1         Apple
     2         Orange
     3         Grape
     5         Orange
     5         Grape

As customer 4 has the same items as customer 1. Also customer 6 with 2. 由于客户4与客户1具有相同的项目，因此客户6与2。

Thanks in advance for your help! 在此先感谢您的帮助！

Answer 1

Not sure if this is what you means. 不确定这是不是你的意思。 But if you mean duplicates based on the items, you can collect the items for each customer as a frozenset (if unique), or tuple (if not unique), and then apply drop_duplicates ; 但是，如果您的意思是基于项目的重复项，您可以将每个客户的项目收集为冻结集 （如果是唯一的）或元组（如果不是唯一的），然后应用drop_duplicates ; later on do a filter on the original data frame based on the customer ID. 稍后根据客户ID对原始数据框进行过滤。

df[df["Customer ID"].isin(df.groupby("Customer ID").Item.apply(frozenset).drop_duplicates().index)]

Or if items are not unique and order doesn't matter: 或者，如果项目不唯一且订单无关紧要：

df[df["Customer ID"].isin(df.groupby("Customer ID")
                            .Item.apply(lambda x: tuple(sorted(x)))
                            .drop_duplicates().index)]

根据两列的值删除数据帧pandas中的重复项

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-03-29 14:30:10

根据两列的值删除数据帧pandas中的重复项

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-03-29 14:30:10

解决方案1
3 已采纳 2017-03-29 14:30:10