简体   繁体   English

根据 Pandas 中两列的组合删除重复项

[英]Remove duplicates based on combination of two columns in Pandas

I need to delete duplicated rows based on combination of two columns (person1 and person2 columns) which have strings.我需要根据具有字符串的两列(person1 和 person2 列)的组合删除重复的行。 For example person1: ryan and person2: delta or person 1: delta and person2: ryan is same and provides the same value in messages column.例如 person1: ryan 和 person2: delta 或 person 1: delta 和 person2: ryan 相同,并在消息列中提供相同的值。 Need to drop one of these two rows.需要删除这两行之一。 Return the non duplicated rows as well.也返回非重复行。

Code to recreate df 
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
                     "person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"], 
                     "person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"], 
                     "messages": [1, 1, 2, 3,3,9,9]})
 df
        person1 person2 messages
0   0   ryan    delta   1
1   1   delta   ryan    1
2   2   delta   alpha   2
3   3   delta   bravo   3
4   4   bravo   delta   3
5   5   alpha   ryan    9
6   6   ryan    alpha   9

Answer df should be:答案 df 应该是:

 finaldf
        person1 person2 messages
0   0   ryan    delta   1
1   2   delta   alpha   2
2   3   delta   bravo   3
3   5   alpha   ryan    9

Try as follows:尝试如下:

res = (df[~df.filter(like='person').apply(frozenset, axis=1).duplicated()]
       .reset_index(drop=True))

print(res)

     person1 person2  messages
0  0    ryan   delta         1
1  2   delta   alpha         2
2  3   delta   bravo         3
3  5   alpha    ryan         9

Explanation解释

  • First, we use df.filter to select just the columns with person* .首先,我们使用df.filter到 select 只是带有person*的列。
  • For these columns only we use df.apply to turn each row ( axis=1 ) into afrozenset .仅对于这些列,我们使用df.apply将每一行 ( axis=1 ) 转换为frozenset So, at this stage, we are looking at a pd.Series like this:所以,在这个阶段,我们正在看一个像这样的pd.Series
0     (ryan, delta)
1     (ryan, delta)
2    (alpha, delta)
3    (bravo, delta)
4    (bravo, delta)
5     (alpha, ryan)
6     (alpha, ryan)
dtype: object
  • Now, we want to select the duplicate rows, using Series.duplicated and add ~ as a prefix to the resulting boolean series to select the inverse from the original df .现在,我们想要 select 重复行,使用Series.duplicated并将~作为结果 boolean 系列的前缀添加到 select 原始df
  • Finally, we reset the index with df.reset_index .最后,我们使用df.reset_index重置索引。

Here's a less general approach than the one given by @ouroboros1, this only works for your two columns case这是一种比@ouroboros1 给出的方法更不通用的方法,这只适用于你的两列情况

#make a Series of strings of min of p1/p2 concat to max of p1/p2  
sorted_p1p2 = df[['person1','person2']].min(axis=1)+'_'+df[['person1','person2']].max(axis=1)

#subset to non-dup from the Series
dedup_df = df[~sorted_p1p2.duplicated()]

You can put the two person columns in order within each row, then drop duplicates.您可以在每一行中按顺序排列两个人的列,然后删除重复项。

import pandas as pd

df = pd.DataFrame({"": [0,1,2,3,4,5,6],
                     "person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
                     "person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
                     "messages": [1, 1, 2, 3,3,9,9]})

print(df)
swap = df['person1'] < df['person2']
df.loc[swap, ['person1', 'person2']] = df.loc[swap, ['person2', 'person1']].values

df = df.drop_duplicates(subset=['person1', 'person2'])

print(df)

After the swap:交换后:

     person1 person2  messages
0  0    ryan   delta         1
1  1    ryan   delta         1
2  2   delta   alpha         2
3  3   delta   bravo         3
4  4   delta   bravo         3
5  5    ryan   alpha         9
6  6    ryan   alpha         9

After dropping duplicates:删除重复项后:

     person1 person2  messages
0  0    ryan   delta         1
2  2   delta   alpha         2
3  3   delta   bravo         3
5  5    ryan   alpha         9

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM