根据 Pandas 中两列的组合删除重复项

Question

I need to delete duplicated rows based on combination of two columns (person1 and person2 columns) which have strings.我需要根据具有字符串的两列（person1 和 person2 列）的组合删除重复的行。 For example person1: ryan and person2: delta or person 1: delta and person2: ryan is same and provides the same value in messages column.例如 person1: ryan 和 person2: delta 或 person 1: delta 和 person2: ryan 相同，并在消息列中提供相同的值。 Need to drop one of these two rows.需要删除这两行之一。 Return the non duplicated rows as well.也返回非重复行。

Code to recreate df 
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
                     "person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"], 
                     "person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"], 
                     "messages": [1, 1, 2, 3,3,9,9]})

 df
        person1 person2 messages
0   0   ryan    delta   1
1   1   delta   ryan    1
2   2   delta   alpha   2
3   3   delta   bravo   3
4   4   bravo   delta   3
5   5   alpha   ryan    9
6   6   ryan    alpha   9

Answer df should be:答案 df 应该是：

 finaldf
        person1 person2 messages
0   0   ryan    delta   1
1   2   delta   alpha   2
2   3   delta   bravo   3
3   5   alpha   ryan    9

Answer 1

Try as follows:尝试如下：

res = (df[~df.filter(like='person').apply(frozenset, axis=1).duplicated()]
       .reset_index(drop=True))

print(res)

     person1 person2  messages
0  0    ryan   delta         1
1  2   delta   alpha         2
2  3   delta   bravo         3
3  5   alpha    ryan         9

Explanation解释

First, we use df.filter to select just the columns with person* .首先，我们使用df.filter到 select 只是带有person*的列。
For these columns only we use df.apply to turn each row ( axis=1 ) into afrozenset .仅对于这些列，我们使用df.apply将每一行 ( axis=1 ) 转换为frozenset 。 So, at this stage, we are looking at a pd.Series like this:所以，在这个阶段，我们正在看一个像这样的pd.Series ：

0     (ryan, delta)
1     (ryan, delta)
2    (alpha, delta)
3    (bravo, delta)
4    (bravo, delta)
5     (alpha, ryan)
6     (alpha, ryan)
dtype: object

Now, we want to select the duplicate rows, using Series.duplicated and add ~ as a prefix to the resulting boolean series to select the inverse from the original df .现在，我们想要 select 重复行，使用Series.duplicated并将~作为结果 boolean 系列的前缀添加到 select 原始df的逆。
Finally, we reset the index with df.reset_index .最后，我们使用df.reset_index重置索引。

Answer 2

Here's a less general approach than the one given by @ouroboros1, this only works for your two columns case这是一种比@ouroboros1 给出的方法更不通用的方法，这只适用于你的两列情况

#make a Series of strings of min of p1/p2 concat to max of p1/p2  
sorted_p1p2 = df[['person1','person2']].min(axis=1)+'_'+df[['person1','person2']].max(axis=1)

#subset to non-dup from the Series
dedup_df = df[~sorted_p1p2.duplicated()]

Answer 3

You can put the two person columns in order within each row, then drop duplicates.您可以在每一行中按顺序排列两个人的列，然后删除重复项。

import pandas as pd

df = pd.DataFrame({"": [0,1,2,3,4,5,6],
                     "person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
                     "person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
                     "messages": [1, 1, 2, 3,3,9,9]})

print(df)
swap = df['person1'] < df['person2']
df.loc[swap, ['person1', 'person2']] = df.loc[swap, ['person2', 'person1']].values

df = df.drop_duplicates(subset=['person1', 'person2'])

print(df)

After the swap:交换后：

     person1 person2  messages
0  0    ryan   delta         1
1  1    ryan   delta         1
2  2   delta   alpha         2
3  3   delta   bravo         3
4  4   delta   bravo         3
5  5    ryan   alpha         9
6  6    ryan   alpha         9

After dropping duplicates:删除重复项后：

     person1 person2  messages
0  0    ryan   delta         1
2  2   delta   alpha         2
3  3   delta   bravo         3
5  5    ryan   alpha         9

根据 Pandas 中两列的组合删除重复项

问题描述

3 个解决方案

解决方案1
2 已采纳 2022-12-06 18:57:52

解决方案2
1 2022-12-06 19:02:13

解决方案3
1 2022-12-06 19:02:59

根据 Pandas 中两列的组合删除重复项

问题描述

3 个解决方案

解决方案1 2 已采纳 2022-12-06 18:57:52

解决方案2 1 2022-12-06 19:02:13

解决方案3 1 2022-12-06 19:02:59

解决方案1
2 已采纳 2022-12-06 18:57:52

解决方案2
1 2022-12-06 19:02:13

解决方案3
1 2022-12-06 19:02:59