I know similar questions have been asked before, but they didn't quite seem to help with my issue so I decided to ask a new question.
What I have are three separate DataFrames - let's call them a
, b
, and c
- that are merged into one large dataframe. In each of these three DataFrames, there may be duplicate pairs of column values that I want to drop, but the condition is that if the pair belongs to DataFrame c
, then I want to keep that pair. For example:
>>> a.head()
unit value target
0 3 23 'a'
1 2 24 'd'
2 8 56 'e'
3 9 89 'p'
4 0 32 'q'
>>> b.head()
unit value target
0 3 34 'a'
1 2 36 'd'
2 8 23 'a'
3 9 89 'p'
4 0 48 'm'
>>> c.head()
unit value target
0 3 34 'a'
1 5 23 'a'
2 2 48 'm'
3 9 56 'e'
4 0 98 'z'
The particular columns that I'm looking to find duplicates in is ( value
, target
). As you can tell, there are a total of four different duplicate scenarios: ( a
, b
), ( b
, c
), ( a
, c
), ( a
, b
, c
). In the above example, the ( value
, target
) pairs that would occur for each scenario would be: ( 89
, 'p'
), ( 34
, 'a'
), ( 56
, 'e'
), and ( 23
, 'a'
), respectively.
If the duplicate occurs in ( a
, b
) it's not a huge problem because I can just simply choose from one of them, but if it occurs in any of the other three scenarios, I want to choose the pair from c
and discard the duplicates from a
and/or b
.
The original idea that I had was to use the following code:
>>> df = pd.concat([a, b, c], axis=0)
>>> df.drop_duplicates(subset=['value', 'target'], keep='last', inplace=True)
Since we're adding c
to the end of the concatenated DataFrame df
, we're guaranteed to retain that value should it occur as a duplicate. However, I was wondering if anyone knew of a way where if ( a
, b
) were to occur, we would select one by random and if c
is included then we always choose c
.
Thanks in advance.
we can use sample
before we combine with c
a_b=pd.concat([a,b]).sample(n=len(a)+len(b))
new=pd.concat([a_b,c]).drop_duplicates(['value', 'target'], keep='last')
new
Out[11]:
unit value target
1 2 24 'd'
4 0 32 'q'
3 9 89 'p'
1 2 36 'd'
0 3 34 'a'
1 5 23 'a'
2 2 48 'm'
3 9 56 'e'
4 0 98 'z'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.