[英]Remove reversed duplicates from a data frame
Can anyone suggest a good solution to remove reversed duplicates from a data frame?任何人都可以提出一个好的解决方案来从数据框中删除反向重复项吗?
My data looks like this, where first and second columns are reversed duplicates.我的数据看起来像这样,其中第一列和第二列是反向重复的。
TRINITY_DN16813_c0_g1_i3 TRINITY_DN16813_c0_g1_i4 96.491 228 8 0 202 429 417 190 3.049999999999999e-104 377
TRINITY_DN16813_c0_g1_i4 TRINITY_DN16813_c0_g1_i3 96.104 231 9 0 190 420 429 199 2.979999999999999e-104 377
I need to keep only one row, where third column has the higher value我只需要保留一行,其中第三列具有更高的值
TRINITY_DN16813_c0_g1_i3 TRINITY_DN16813_c0_g1_i4 96.491 228 8 0 202 429 417 190 3.049999999999999e-104 377
This the results when I use series.isin().这是我使用 series.isin() 时的结果。
TRINITY_DN28139_c0_g1_i2 TRINITY_DN28139_c0_g1_i5 99.971 3465 1 0 1 3465 1 3465 0.0 6394
TRINITY_DN28139_c0_g1_i5 TRINITY_DN28139_c0_g1_i2 99.971 3465 1 0 1 3465 1 3465 0.0 6394
TRINITY_DN25313_c0_g1_i6 TRINITY_DN25313_c0_g1_i5 99.97 3315 1 0 1 3315 1 3315 0.0 6117
TRINITY_DN25313_c0_g1_i5 TRINITY_DN25313_c0_g1_i6 99.97 3315 1 0 1 3315 1 3315 0.0 6117
TRINITY_DN25502_c0_g1_i3 TRINITY_DN25502_c0_g1_i4 99.96799999999999 3078 1 0 1 3078 1 3078 0.0 5679
TRINITY_DN25502_c0_g1_i4 TRINITY_DN25502_c0_g1_i3 99.96799999999999 3078 1 0 1 3078 1 3078 0.0 5679
TRINITY_DN28726_c0_g1_i2 TRINITY_DN28726_c0_g1_i1 99.96600000000001 5805 2 0 1 5805 1 5805 0.0 10709
TRINITY_DN28726_c0_g1_i1 TRINITY_DN28726_c0_g1_i2 99.96600000000001 5805 2 0 1 5805 1 5805 0.0 10709
TRINITY_DN27942_c0_g1_i7 TRINITY_DN27942_c0_g1_i6 99.964 2760 1 0 1 2760 1 2760 0.0 5092
TRINITY_DN25118_c0_g1_i1 TRINITY_DN25118_c0_g1_i2 99.964 2770 1 0 81 2850 204 2973 0.0 5110
TRINITY_DN27942_c0_g1_i6 TRINITY_DN27942_c0_g1_i7 99.964 2760 1 0 1 2760 1 2760 0.0 5092
TRINITY_DN25118_c0_g1_i2 TRINITY_DN25118_c0_g1_i1 99.964 2770 1 0 204 2973 81 2850 0.0 5110
TRINITY_DN28502_c1_g1_i9 TRINITY_DN28502_c1_g1_i7 99.963 2678 1 0 1928 4605 2021 4698 0.0 4940
TRINITY_DN28502_c1_g1_i7 TRINITY_DN28502_c1_g1_i9 99.963 2678 1 0 2021 4698 1928 4605 0.0 4940
TRINITY_DN25619_c0_g1_i1 TRINITY_DN25619_c0_g1_i8 99.963 2715 1 0 1 2715 1 2715 0.0 5009
TRINITY_DN25619_c0_g1_i8 TRINITY_DN25619_c0_g1_i1 99.963 2715 1 0 1 2715 1 2715 0.0 5009
TRINITY_DN23022_c0_g1_i5 TRINITY_DN23022_c0_g1_i1 99.962 2622 1 0 1 2622 1 2622 0.0 4837
Use series.isin()
to find same entries in both columns and drop duplicates:使用series.isin()
在两列中查找相同的条目并删除重复项:
df=df.sort_values('col3',ascending=False)
df.loc[df['col1'].isin(df['col2']).drop_duplicates().index]
Where col1
is the first column and col2
is the second其中col1
是第一列, col2
是第二列
Output:输出:
0 TRINITY_DN16813_c0_g1_i3 TRINITY_DN16813_c0_g1_i4 96.49 228 8 0 202 429 417 190 0.00 377
Try this one.试试这个。 It's completely in pandas (should be faster) This also corrects bugs in my previous answer but the concept of taking the labels as a pair remains the same.它完全在熊猫中(应该更快)这也纠正了我之前答案中的错误,但将标签作为一对的概念保持不变。
In [384]: df['pair'] = df[[0, 1]].apply(lambda x: '{}-{}'.format(*sorted((x[0], x[1]))), axis=1)
Get only max values per duplicated result:仅获取每个重复结果的最大值:
In [385]: dfd = df.loc[df.groupby('pair')[2].idxmax()]
If you need the names to be in separate columns:如果您需要将名称放在单独的列中:
In [398]: dfd[0] = dfd['pair'].transform(lambda x: x.split('-')[0])
In [399]: dfd[1] = dfd['pair'].transform(lambda x: x.split('-')[1])
The problem is that labels in column 0 and column 1 must be taken as a pair so an isin
alone would not work问题是第 0 列和第 1 列中的标签必须成对使用,因此单独使用isin
是行不通的
First, a list of label pairs is needed to compare to ( forward
in the code).首先,需要一个标签对列表来与(代码中的forward
)进行比较。 Given that (a,b)
is the same as (b,a)
, all instances will just be replaced by (a,b)
鉴于(a,b)
与(b,a)
,所有实例都将被(a,b)
替换
Then all labels that are duplicated are renamed in the order a,b
even if the higher row is b,a
.然后所有重复的标签都按a,b
的顺序重命名a,b
即使较高的行是b,a
。 This is necessary to do the grouping step later.这是稍后进行分组步骤所必需的。
In [293]: df['pair'] = df[[0, 1]].apply(l, axis=1)
Then to account for the value of column 2 (third column from left), the original data is grouped and the min of the group is kept.然后为了说明第 2 列(左起第三列)的值,将原始数据分组并保留组的最小值。 This will be the rows to be removed.这将是要删除的行。
In [297]: dfi = df.set_index(['pair',2])
In [298]: to_drop = df.groupby([0,1])[2].min().reset_index().set_index([0,1,2]).index
In [299]: dfi['drop'] = dfi.index.isin(to_drop)
In [300]: dfr = dfi.reset_index()
Rows are dropped by the index number where the 'drop' column is True.行按索引号删除,其中 'drop' 列为 True。 The temporary 'drop' column is also removed.临时“drop”列也被删除。
In [301]: df_dropped = dfr.drop(np.where(dfr['drop'])[0], axis=0).drop('drop', axis=1)
In [302]: df_dropped
Out[302]:
0 1 2 3 4 5 6 7 8 9 10 11
0 TRINITY_DN16813_c0_g1_i3 TRINITY_DN16813_c0_g1_i4 96.491 228 8 0 202 429 417 190 3.050000e-104 377
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.