從 dataframe 中刪除反向重復項

Question

我有一個包含兩列A和B的數據框。 在這種情況下， A和B的順序並不重要； 例如，我認為(0,50)和(50,0)是重復的。 在 pandas 中，從 dataframe 中刪除這些重復項的有效方法是什么？

import pandas as pd

# Initial data frame.
data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50], 
                     'B': [50, 22, 35, 5, 10, 11, 21, 0]})
data
    A   B
0   0  50
1  10  22
2  11  35
3  21   5
4  22  10
5  35  11
6   5  21
7  50   0

# Desired output with "duplicates" removed. 
data2 = pd.DataFrame({'A': [0, 5, 10, 11], 
                      'B': [50, 21, 22, 35]})
data2
    A   B
0   0  50
1   5  21
2  10  22
3  11  35

理想情況下， output 將按A列的值排序。

Answer 1

您可以在刪除重復項之前對數據框的每一行進行排序：

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()

#   A    B
#0  0   50
#1  10  22
#2  11  35
#3  5   21

如果您希望按A列對結果進行排序：

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A')

#   A    B
#0  0   50
#3  5   21
#1  10  22
#2  11  35

Answer 2

這里有點難看，但更快的解決方案：

In [44]: pd.DataFrame(np.sort(data.values, axis=1), columns=data.columns).drop_duplicates()
Out[44]:
    A   B
0   0  50
1  10  22
2  11  35
3   5  21

時序：對於 8K 行 DF

In [50]: big = pd.concat([data] * 10**3, ignore_index=True)

In [51]: big.shape
Out[51]: (8000, 2)

In [52]: %timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
1 loop, best of 3: 3.04 s per loop

In [53]: %timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates()
100 loops, best of 3: 3.96 ms per loop

In [59]: %timeit big.apply(np.sort, axis = 1).drop_duplicates()
1 loop, best of 3: 2.69 s per loop

Answer 3

df.T.apply(sorted).T.drop_duplicates()

Answer 4

現在這個解決方案有效，

data.set_index(['A','B']).stack().drop_duplicates().unstack().reset_index()

可以根據需要添加更多列。 例如

data.set_index(['A','B', 'C']).stack().drop_duplicates().unstack().reset_index()

Answer 5

這是一個有點冗長的解決方案，但可能對初學者有幫助 -

創建新列以對 A 列和 B 列中的值進行跨行排序 -

data['C'] = np.where(data['A']<data['B'] , data['A'], data['B'])
data['D'] = np.where(data['A']>data['B'] , data['A'], data['B'])

根據相關要求刪除重復項並按列“C”進行排序並重命名列

data2 = data[['C', 'D']].drop_duplicates().sort_values('C')
data2.columns = ['A', 'B']   
data2

PS - “np.where”函數的工作原理類似於 excel 中的 If 公式（邏輯條件，值為 TRUE，值為 FALSE）

Answer 6

另一個經典選項是將值聚合為凍結集並使用boolean 索引

out = data[~data[['A', 'B']].agg(frozenset, axis=1).duplicated()]

Output：

它也相當有效，盡管不如非常優化的np.sort方法：

%timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
27.2 ms ± 914 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates()
733 µs ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit big.apply(np.sort, axis = 1).drop_duplicates()
12 s ± 403 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit big[~big[['A', 'B']].agg(frozenset, axis=1).duplicated()]
25 ms ± 657 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

從 dataframe 中刪除反向重復項

問題描述

6 個解決方案

解決方案1
12 已采納 2016-11-07 21:22:43

解決方案2
10 2016-11-07 21:30:42

解決方案3
1 2021-07-01 20:17:44

解決方案4
0 2020-06-09 08:12:29

解決方案5
0 2021-07-25 20:38:09

解決方案6
0 2022-12-08 10:14:46

從 dataframe 中刪除反向重復項

問題描述

6 個解決方案

解決方案1 12 已采納 2016-11-07 21:22:43

解決方案2 10 2016-11-07 21:30:42

解決方案3 1 2021-07-01 20:17:44

解決方案4 0 2020-06-09 08:12:29

解決方案5 0 2021-07-25 20:38:09

解決方案6 0 2022-12-08 10:14:46

解決方案1
12 已采納 2016-11-07 21:22:43

解決方案2
10 2016-11-07 21:30:42

解決方案3
1 2021-07-01 20:17:44

解決方案4
0 2020-06-09 08:12:29

解決方案5
0 2021-07-25 20:38:09

解決方案6
0 2022-12-08 10:14:46