删除反向重复

Question

My dataframe looks like this: 我的数据框看起来像这样：

df_in = pd.DataFrame(data={'mol1':['cpd1','cpd2', 'cpd3'], 'mol2': ['cpd2','cpd1', 'cpd4'], 'sim': [0.8,0.8,0.9]})

print(df_in)

   mol1  mol2  sim
0  cpd1  cpd2  0.8
1  cpd2  cpd1  0.8
2  cpd3  cpd4  0.9

The pair (cpd1, cpd2) occurs twice although each element does not belong to the same column. 该对（cpd1，cpd2）出现两次，尽管每个元素不属于同一列。

I would like to get rid of these duplicates to end up with this: 我想摆脱这些重复，最终得到这个：

df_out = pd.DataFrame(data={'mol1':['cpd1', 'cpd3'], 'mol2': ['cpd2', 'cpd4'], 'sim': [0.8,0.9]})

print(df_out)

   mol1  mol2  sim
0  cpd1  cpd2  0.8
1  cpd3  cpd4  0.9

If I ignore the third column, there is a solution describes in Pythonic way of removing reversed duplicates in list , but I have to preserve this column. 如果我忽略第三列，有一个解决方案以Pythonic方式描述删除列表中的反向重复项，但我必须保留此列。

Answer 1

You can use sorted with apply for columns from list cols and then drop_duplicates : 您可以使用sorted与apply从列表列cols然后drop_duplicates ：

cols = ['mol1','mol2']
df[cols] = df[cols].apply(sorted, axis=1)
df = df.drop_duplicates()
print (df)
   mol1  mol2  sim
0  cpd1  cpd2  0.8
2  cpd3  cpd4  0.9

Similar solution with numpy.sort : 与numpy.sort类似的解决方案：

cols = ['mol1','mol2']
df[cols] = np.sort(df[cols].values, axis=1)
df = df.drop_duplicates()
print (df)
   mol1  mol2  sim
0  cpd1  cpd2  0.8
2  cpd3  cpd4  0.9

If need check duplicates only in cols add parameter subset : 如果需要仅在cols检查重复项添加参数subset ：

df = pd.DataFrame(
{'mol1':['cpd1','cpd2', 'cpd3'], 
'mol2': ['cpd2', 'cpd1', 'cpd4'], 
'sim': [0.7,0.8,0.9]})
print (df)
   mol1  mol2  sim
0  cpd1  cpd2  0.7
1  cpd2  cpd1  0.8
2  cpd3  cpd4  0.9

cols = ['mol1','mol2']
df[cols] = np.sort(df[cols].values, axis=1)
df = df.drop_duplicates(subset=cols)
print (df)
   mol1  mol2  sim
0  cpd1  cpd2  0.7
2  cpd3  cpd4  0.9

删除反向重复

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-06-09 11:06:51

删除反向重复

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-06-09 11:06:51

解决方案1
3 已采纳 2017-06-09 11:06:51