简体   繁体   中英

python pandas how to drop duplicates selectively

I need to look at all the rows in a column ['b'] and if the row is non-empty go to another corresponding column ['c'] and drop duplicates of this particular index against all other rows in that third column ['c'] while preserving this particular index. I came across drop_duplicates, however I was unable to find a way to only look for duplicates of a highlighted row as opposed to all duplicates in a column. I can't use drop_duplicates on the whole column because I want to retain duplicates in this column that may correspond to only empty values in column ['b'].

So possible scenarios would be: if in ['b'] you find a non empty value, you may go to the current index in ['c'] and find all duplicates of that ONE index and drop those. These duplicates could correspond to empty OR non-empty values in ['b']. If in ['b'] you find empty value skip to next index. This way it is possible that empty value indices in ['b'] get removed indirectly because they are duplicates of an index in ['c'] corresponding from a non empty ['b'] value.

Edited With Sample Data:

Preprocessed:

df1 = pd.DataFrame([['','CCCH'], ['CHC','CCCH'], ['CCHCC','CNHCC'], ['','CCCH'], ['CNHCC','CNOCH'], ['','NCH'], ['','NCH']], columns=['B', 'C'])  

df1

    B     C  
0         CCCH
1   CHC   CCCH
2   CCHCC CNHCC
3         CCCH
4   CNHCC CNOCH
5         NCH
6         NCH

Post Processing and dropping correct duplicates:

df2 = pd.DataFrame([['CHC','CCCH'], ['CCHCC','CNHCC'], ['CNHCC','CNOCH'], ['','NCH'], ['','NCH']], columns=['B', 'C'])

df2

    B     C
1   CHC   CCCH
2   CCHCC CNHCC
4   CNHCC CNOCH
5         NCH
6         NCH

Above we see the result that the only rows removed were rows 0,3 as they are duplicates in column ['C'] of row 1 which has a non zero 'B' value. Row 5,6 are kept even though they are duplicates of each other in column ['C'] because they have no non zero 'B' value. Rows 2 and 4 are kept because they are not duplicates in column ['C'].

So the logic would be to go through each row in column 'B' if it is empty then move down a row and continue. If it is not empty then go to its corresponding column 'C' and drop any duplicates of that column 'C' row ONLY while preserving that index and then continue to the next row untill this logic has been applied to all values in column 'B'.

Column B value empty --> Look at next value in Column B

| or if not empty |

Column B not empty --> Column C --> Drop all duplicates of that index of Column C while keeping the current index --> Look at next value in Column B

Say you group your DataFrame according to the 'C' column, and check each group for the existence of a 'B' -column non-empty entry:

  • If there is no such entry, return the entire group

  • Otherwise, return the group, for the non-empty entries in 'B' , with the duplicates dropped

In code:

def remove_duplicates(g):                                    
    return g if sum(g.B == '') == len(g) else g[g.B != ''].drop_duplicates(subset='B')

>>> df1.groupby(df1.C).apply(remove_duplicates)['B'].reset_index()[['B', 'C']]
       B      C
0    CHC   CCCH
1  CCHCC  CNHCC
2  CNHCC  CNOCH
3           NCH
4           NCH

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM