简体   繁体   中英

How to drop Pandas DataFrame rows with condition to keep specific column value

I know similar questions have been asked before, but they didn't quite seem to help with my issue so I decided to ask a new question.

What I have are three separate DataFrames - let's call them a , b , and c - that are merged into one large dataframe. In each of these three DataFrames, there may be duplicate pairs of column values that I want to drop, but the condition is that if the pair belongs to DataFrame c , then I want to keep that pair. For example:

>>> a.head()
    unit    value    target
 0   3       23       'a'
 1   2       24       'd'
 2   8       56       'e'
 3   9       89       'p'
 4   0       32       'q'

>>> b.head()
    unit    value    target
 0   3       34       'a'
 1   2       36       'd'
 2   8       23       'a'
 3   9       89       'p'
 4   0       48       'm'

>>> c.head()
    unit    value    target
 0   3       34       'a'
 1   5       23       'a'
 2   2       48       'm'
 3   9       56       'e'
 4   0       98       'z'

The particular columns that I'm looking to find duplicates in is ( value , target ). As you can tell, there are a total of four different duplicate scenarios: ( a , b ), ( b , c ), ( a , c ), ( a , b , c ). In the above example, the ( value , target ) pairs that would occur for each scenario would be: ( 89 , 'p' ), ( 34 , 'a' ), ( 56 , 'e' ), and ( 23 , 'a' ), respectively.

If the duplicate occurs in ( a , b ) it's not a huge problem because I can just simply choose from one of them, but if it occurs in any of the other three scenarios, I want to choose the pair from c and discard the duplicates from a and/or b .

The original idea that I had was to use the following code:

>>> df = pd.concat([a, b, c], axis=0)
>>> df.drop_duplicates(subset=['value', 'target'], keep='last', inplace=True)

Since we're adding c to the end of the concatenated DataFrame df , we're guaranteed to retain that value should it occur as a duplicate. However, I was wondering if anyone knew of a way where if ( a , b ) were to occur, we would select one by random and if c is included then we always choose c .

Thanks in advance.

we can use sample before we combine with c

a_b=pd.concat([a,b]).sample(n=len(a)+len(b))
new=pd.concat([a_b,c]).drop_duplicates(['value', 'target'], keep='last')
new
Out[11]: 
   unit  value target
1     2     24    'd'
4     0     32    'q'
3     9     89    'p'
1     2     36    'd'
0     3     34    'a'
1     5     23    'a'
2     2     48    'm'
3     9     56    'e'
4     0     98    'z'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM