[英]Remove duplicates based on combination of two columns in Pandas
I need to delete duplicated rows based on combination of two columns (person1 and person2 columns) which have strings.我需要根据具有字符串的两列(person1 和 person2 列)的组合删除重复的行。 For example person1: ryan and person2: delta or person 1: delta and person2: ryan is same and provides the same value in messages column.例如 person1: ryan 和 person2: delta 或 person 1: delta 和 person2: ryan 相同,并在消息列中提供相同的值。 Need to drop one of these two rows.需要删除这两行之一。 Return the non duplicated rows as well.也返回非重复行。
Code to recreate df
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
"person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
"person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
"messages": [1, 1, 2, 3,3,9,9]})
df
person1 person2 messages
0 0 ryan delta 1
1 1 delta ryan 1
2 2 delta alpha 2
3 3 delta bravo 3
4 4 bravo delta 3
5 5 alpha ryan 9
6 6 ryan alpha 9
Answer df should be:答案 df 应该是:
finaldf
person1 person2 messages
0 0 ryan delta 1
1 2 delta alpha 2
2 3 delta bravo 3
3 5 alpha ryan 9
Try as follows:尝试如下:
res = (df[~df.filter(like='person').apply(frozenset, axis=1).duplicated()]
.reset_index(drop=True))
print(res)
person1 person2 messages
0 0 ryan delta 1
1 2 delta alpha 2
2 3 delta bravo 3
3 5 alpha ryan 9
Explanation解释
df.filter
to select just the columns with person*
.首先,我们使用df.filter
到 select 只是带有person*
的列。df.apply
to turn each row ( axis=1
) into afrozenset
.仅对于这些列,我们使用df.apply
将每一行 ( axis=1
) 转换为frozenset
。 So, at this stage, we are looking at a pd.Series
like this:所以,在这个阶段,我们正在看一个像这样的pd.Series
:0 (ryan, delta)
1 (ryan, delta)
2 (alpha, delta)
3 (bravo, delta)
4 (bravo, delta)
5 (alpha, ryan)
6 (alpha, ryan)
dtype: object
Series.duplicated
and add ~
as a prefix to the resulting boolean series to select the inverse from the original df
.现在,我们想要 select 重复行,使用Series.duplicated
并将~
作为结果 boolean 系列的前缀添加到 select 原始df
的逆。df.reset_index
.最后,我们使用df.reset_index
重置索引。Here's a less general approach than the one given by @ouroboros1, this only works for your two columns case这是一种比@ouroboros1 给出的方法更不通用的方法,这只适用于你的两列情况
#make a Series of strings of min of p1/p2 concat to max of p1/p2
sorted_p1p2 = df[['person1','person2']].min(axis=1)+'_'+df[['person1','person2']].max(axis=1)
#subset to non-dup from the Series
dedup_df = df[~sorted_p1p2.duplicated()]
You can put the two person columns in order within each row, then drop duplicates.您可以在每一行中按顺序排列两个人的列,然后删除重复项。
import pandas as pd
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
"person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
"person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
"messages": [1, 1, 2, 3,3,9,9]})
print(df)
swap = df['person1'] < df['person2']
df.loc[swap, ['person1', 'person2']] = df.loc[swap, ['person2', 'person1']].values
df = df.drop_duplicates(subset=['person1', 'person2'])
print(df)
After the swap:交换后:
person1 person2 messages
0 0 ryan delta 1
1 1 ryan delta 1
2 2 delta alpha 2
3 3 delta bravo 3
4 4 delta bravo 3
5 5 ryan alpha 9
6 6 ryan alpha 9
After dropping duplicates:删除重复项后:
person1 person2 messages
0 0 ryan delta 1
2 2 delta alpha 2
3 3 delta bravo 3
5 5 ryan alpha 9
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.