[英]Remove reverse duplicates from dataframe
I have a data frame with two columns, A
and B
.我有一个包含两列
A
和B
的数据框。 The order of A
and B
is unimportant in this context;在这种情况下,
A
和B
的顺序并不重要; for example, I would consider (0,50)
and (50,0)
to be duplicates.例如,我认为
(0,50)
和(50,0)
是重复的。 In pandas, what is an efficient way to remove these duplicates from a dataframe?在 pandas 中,从 dataframe 中删除这些重复项的有效方法是什么?
import pandas as pd
# Initial data frame.
data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50],
'B': [50, 22, 35, 5, 10, 11, 21, 0]})
data
A B
0 0 50
1 10 22
2 11 35
3 21 5
4 22 10
5 35 11
6 5 21
7 50 0
# Desired output with "duplicates" removed.
data2 = pd.DataFrame({'A': [0, 5, 10, 11],
'B': [50, 21, 22, 35]})
data2
A B
0 0 50
1 5 21
2 10 22
3 11 35
Ideally, the output would be sorted by values of column A
.理想情况下, output 将按
A
列的值排序。
You can sort each row of the data frame before dropping the duplicates:您可以在删除重复项之前对数据框的每一行进行排序:
data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
# A B
#0 0 50
#1 10 22
#2 11 35
#3 5 21
If you prefer the result to be sorted by column A
:如果您希望按
A
列对结果进行排序:
data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A')
# A B
#0 0 50
#3 5 21
#1 10 22
#2 11 35
Here is bit uglier, but faster solution:这里有点难看,但更快的解决方案:
In [44]: pd.DataFrame(np.sort(data.values, axis=1), columns=data.columns).drop_duplicates()
Out[44]:
A B
0 0 50
1 10 22
2 11 35
3 5 21
Timing: for 8K rows DF时序:对于 8K 行 DF
In [50]: big = pd.concat([data] * 10**3, ignore_index=True)
In [51]: big.shape
Out[51]: (8000, 2)
In [52]: %timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
1 loop, best of 3: 3.04 s per loop
In [53]: %timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates()
100 loops, best of 3: 3.96 ms per loop
In [59]: %timeit big.apply(np.sort, axis = 1).drop_duplicates()
1 loop, best of 3: 2.69 s per loop
df.T.apply(sorted).T.drop_duplicates()
Now this solution works,现在这个解决方案有效,
data.set_index(['A','B']).stack().drop_duplicates().unstack().reset_index()
More columns could be added as well as per necessity.可以根据需要添加更多列。 eg
例如
data.set_index(['A','B', 'C']).stack().drop_duplicates().unstack().reset_index()
Here is a bit lengthy solution, but might be helpful for beginners -这是一个有点冗长的解决方案,但可能对初学者有帮助 -
Creating new columns for sorting values from Column A & B across row -创建新列以对 A 列和 B 列中的值进行跨行排序 -
data['C'] = np.where(data['A']<data['B'] , data['A'], data['B'])
data['D'] = np.where(data['A']>data['B'] , data['A'], data['B'])
Removing Duplicates & sorting as per column 'C' as requested in question & renaming the columns根据相关要求删除重复项并按列“C”进行排序并重命名列
data2 = data[['C', 'D']].drop_duplicates().sort_values('C')
data2.columns = ['A', 'B']
data2
PS - "np.where" function works similar to If formula in excel (Logical Condition, Value if TRUE, Value if FALSE) PS - “np.where”函数的工作原理类似于 excel 中的 If 公式(逻辑条件,值为 TRUE,值为 FALSE)
Another classical option is to aggregate the values as a frozenset and to use boolean indexing另一个经典选项是将值聚合为冻结集并使用boolean 索引
out = data[~data[['A', 'B']].agg(frozenset, axis=1).duplicated()]
Output: Output:
A B
0 0 50
1 10 22
2 11 35
3 21 5
It's also fairly efficient, although not as much as the very optimized np.sort
approach:它也相当有效,尽管不如非常优化的
np.sort
方法:
%timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
27.2 ms ± 914 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates()
733 µs ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit big.apply(np.sort, axis = 1).drop_duplicates()
12 s ± 403 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit big[~big[['A', 'B']].agg(frozenset, axis=1).duplicated()]
25 ms ± 657 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.