简体   繁体   English

根据两列的值删除数据帧pandas中的重复项

[英]Remove duplicates in dataframe pandas based on values of two columns

I have a dataframe of customers with some items, which looks like this: 我有一些客户的数据框,其中包含一些项目,如下所示:

Customer ID     Item
     1         Banana
     1         Apple
     2         Orange
     3         Grape
     4         Banana
     4         Apple
     5         Orange
     5         Grape
     6         Orange

What I'm willing to do is to remove all duplicates customers with same items, so the results should look like this: 我愿意做的是删除所有具有相同项目的重复客户,因此结果应如下所示:

Customer ID     Item
     1         Banana
     1         Apple
     2         Orange
     3         Grape
     5         Orange
     5         Grape

As customer 4 has the same items as customer 1. Also customer 6 with 2. 由于客户4与客户1具有相同的项目,因此客户6与2。

Thanks in advance for your help! 在此先感谢您的帮助!

Not sure if this is what you means. 不确定这是不是你的意思。 But if you mean duplicates based on the items, you can collect the items for each customer as a frozenset (if unique), or tuple (if not unique), and then apply drop_duplicates ; 但是,如果您的意思是基于项目的重复项,您可以将每个客户的项目收集为冻结集 (如果是唯一的)或元组 (如果不是唯一的),然后应用drop_duplicates ; later on do a filter on the original data frame based on the customer ID. 稍后根据客户ID对原始数据框进行过滤。

df[df["Customer ID"].isin(df.groupby("Customer ID").Item.apply(frozenset).drop_duplicates().index)]

在此输入图像描述

Or if items are not unique and order doesn't matter: 或者,如果项目不唯一且订单无关紧要:

df[df["Customer ID"].isin(df.groupby("Customer ID")
                            .Item.apply(lambda x: tuple(sorted(x)))
                            .drop_duplicates().index)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM