简体   繁体   English

从数据框中删除反向重复项

[英]Remove reversed duplicates from a data frame

Can anyone suggest a good solution to remove reversed duplicates from a data frame?任何人都可以提出一个好的解决方案来从数据框中删除反向重复项吗?

My data looks like this, where first and second columns are reversed duplicates.我的数据看起来像这样,其中第一列和第二列是反向重复的。

TRINITY_DN16813_c0_g1_i3    TRINITY_DN16813_c0_g1_i4    96.491  228 8   0   202 429 417 190 3.049999999999999e-104  377
TRINITY_DN16813_c0_g1_i4    TRINITY_DN16813_c0_g1_i3    96.104  231 9   0   190 420 429 199 2.979999999999999e-104  377

I need to keep only one row, where third column has the higher value我只需要保留一行,其中第三列具有更高的值

TRINITY_DN16813_c0_g1_i3    TRINITY_DN16813_c0_g1_i4    96.491  228 8   0   202 429 417 190 3.049999999999999e-104  377

This the results when I use series.isin().这是我使用 series.isin() 时的结果。

TRINITY_DN28139_c0_g1_i2    TRINITY_DN28139_c0_g1_i5    99.971  3465    1   0   1   3465    1   3465    0.0 6394
TRINITY_DN28139_c0_g1_i5    TRINITY_DN28139_c0_g1_i2    99.971  3465    1   0   1   3465    1   3465    0.0 6394
TRINITY_DN25313_c0_g1_i6    TRINITY_DN25313_c0_g1_i5    99.97   3315    1   0   1   3315    1   3315    0.0 6117
TRINITY_DN25313_c0_g1_i5    TRINITY_DN25313_c0_g1_i6    99.97   3315    1   0   1   3315    1   3315    0.0 6117
TRINITY_DN25502_c0_g1_i3    TRINITY_DN25502_c0_g1_i4    99.96799999999999   3078    1   0   1   3078    1   3078    0.0 5679
TRINITY_DN25502_c0_g1_i4    TRINITY_DN25502_c0_g1_i3    99.96799999999999   3078    1   0   1   3078    1   3078    0.0 5679
TRINITY_DN28726_c0_g1_i2    TRINITY_DN28726_c0_g1_i1    99.96600000000001   5805    2   0   1   5805    1   5805    0.0 10709
TRINITY_DN28726_c0_g1_i1    TRINITY_DN28726_c0_g1_i2    99.96600000000001   5805    2   0   1   5805    1   5805    0.0 10709
TRINITY_DN27942_c0_g1_i7    TRINITY_DN27942_c0_g1_i6    99.964  2760    1   0   1   2760    1   2760    0.0 5092
TRINITY_DN25118_c0_g1_i1    TRINITY_DN25118_c0_g1_i2    99.964  2770    1   0   81  2850    204 2973    0.0 5110
TRINITY_DN27942_c0_g1_i6    TRINITY_DN27942_c0_g1_i7    99.964  2760    1   0   1   2760    1   2760    0.0 5092
TRINITY_DN25118_c0_g1_i2    TRINITY_DN25118_c0_g1_i1    99.964  2770    1   0   204 2973    81  2850    0.0 5110
TRINITY_DN28502_c1_g1_i9    TRINITY_DN28502_c1_g1_i7    99.963  2678    1   0   1928    4605    2021    4698    0.0 4940
TRINITY_DN28502_c1_g1_i7    TRINITY_DN28502_c1_g1_i9    99.963  2678    1   0   2021    4698    1928    4605    0.0 4940
TRINITY_DN25619_c0_g1_i1    TRINITY_DN25619_c0_g1_i8    99.963  2715    1   0   1   2715    1   2715    0.0 5009
TRINITY_DN25619_c0_g1_i8    TRINITY_DN25619_c0_g1_i1    99.963  2715    1   0   1   2715    1   2715    0.0 5009
TRINITY_DN23022_c0_g1_i5    TRINITY_DN23022_c0_g1_i1    99.962  2622    1   0   1   2622    1   2622    0.0 4837

Use series.isin() to find same entries in both columns and drop duplicates:使用series.isin()在两列中查找相同的条目并删除重复项:

df=df.sort_values('col3',ascending=False)
df.loc[df['col1'].isin(df['col2']).drop_duplicates().index]

Where col1 is the first column and col2 is the second其中col1是第一列, col2是第二列

Output:输出:

0   TRINITY_DN16813_c0_g1_i3    TRINITY_DN16813_c0_g1_i4    96.49   228 8   0   202 429 417 190 0.00    377

Try this one.试试这个。 It's completely in pandas (should be faster) This also corrects bugs in my previous answer but the concept of taking the labels as a pair remains the same.它完全在熊猫中(应该更快)这也纠正了我之前答案中的错误,但将标签作为一对的概念保持不变。

In [384]: df['pair'] = df[[0, 1]].apply(lambda x: '{}-{}'.format(*sorted((x[0], x[1]))), axis=1)

Get only max values per duplicated result:仅获取每个重复结果的最大值:

In [385]: dfd = df.loc[df.groupby('pair')[2].idxmax()]

If you need the names to be in separate columns:如果您需要将名称放在单独的列中:

In [398]: dfd[0] = dfd['pair'].transform(lambda x: x.split('-')[0])
In [399]: dfd[1] = dfd['pair'].transform(lambda x: x.split('-')[1])

The problem is that labels in column 0 and column 1 must be taken as a pair so an isin alone would not work问题是第 0 列和第 1 列中的标签必须成对使用,因此单独使用isin是行不通的

First, a list of label pairs is needed to compare to ( forward in the code).首先,需要一个标签对列表来与(代码中的forward )进行比较。 Given that (a,b) is the same as (b,a) , all instances will just be replaced by (a,b)鉴于(a,b)(b,a) ,所有实例都将被(a,b)替换

Then all labels that are duplicated are renamed in the order a,b even if the higher row is b,a .然后所有重复的标签都按a,b的顺序重命名a,b即使较高的行是b,a This is necessary to do the grouping step later.这是稍后进行分组步骤所必需的。

In [293]: df['pair'] = df[[0, 1]].apply(l, axis=1)

Then to account for the value of column 2 (third column from left), the original data is grouped and the min of the group is kept.然后为了说明第 2 列(左起第三列)的值,将原始数据分组并保留组的最小值。 This will be the rows to be removed.这将是要删除的行。

In [297]: dfi = df.set_index(['pair',2])

In [298]: to_drop = df.groupby([0,1])[2].min().reset_index().set_index([0,1,2]).index

In [299]: dfi['drop'] = dfi.index.isin(to_drop)

In [300]: dfr = dfi.reset_index()

Rows are dropped by the index number where the 'drop' column is True.行按索引号删除,其中 'drop' 列为 True。 The temporary 'drop' column is also removed.临时“drop”列也被删除。

In [301]: df_dropped = dfr.drop(np.where(dfr['drop'])[0], axis=0).drop('drop', axis=1)

In [302]: df_dropped
Out[302]:
                         0                         1       2    3   4   5    6    7    8    9              10   11
0  TRINITY_DN16813_c0_g1_i3  TRINITY_DN16813_c0_g1_i4  96.491  228   8   0  202  429  417  190  3.050000e-104  377

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM