简体   繁体   中英

Drop Duplicates and Add Values Pandas

I have a dataframe below. I would like to drop the duplicates, but add the duplicated value from the E column to the non-duplicated record

import pandas as pd
import numpy as np
dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,6,7], 
                    'B' : [1,1,3,5,0,0,np.NaN,9,0,0], 
                    'C' : ['AA1233445','AA1233445', 'rmacy','Idaho Rx','Ab123455','TV192837','RX','Ohio Drugs','RX12345','USA Pharma'], 
                    'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
                    'E' : ['Assign','Allign','Hello','Ugly','Appreciate','Undo','Testing','Unicycle','Pharma','Unicorn',]})
print(dfp)

I'm grabbing all the duplicates:

df2 = dfp.loc[(dfp['A'].duplicated(keep=False))].copy()

     A    B          C           D           E
0  NaN  1.0  AA1233445    123456.0      Assign
1  NaN  1.0  AA1233445    123456.0      Allign
2  3.0  3.0      rmacy   1234567.0       Hello
4  5.0  0.0   Ab123455     12345.0  Appreciate
5  5.0  0.0   TV192837     12345.0        Undo
6  3.0  NaN         RX  12345678.0     Testing

and would like my outcome to be:

     A    B          C           D           E
0  NaN  1.0  AA1233445    123456.0      Assign Allign
2  3.0  3.0      rmacy   1234567.0      Hello Testing
4  5.0  0.0   Ab123455     12345.0      Appreciate Undo

I know I need to use dfp.loc[(dfp['A'].duplicated(keep='last'))].copy() to grab the first occurrence, but I'm failing to set the value of the E column to include the other duplicated values.

I'm thinking I need to try something like:

df3 = dfp.loc[(dfp['A'].duplicated(keep='last'))].copy()
df3['E'] = df3['E'] + dfp.loc[(dfp['A'].duplicated(keep=False).copy()),'E']

but my output is:

     A    B          C          D                     E
0  NaN  1.0  AA1233445   123456.0          AssignAssign
2  3.0  3.0      rmacy  1234567.0            HelloHello
4  5.0  0.0   Ab123455    12345.0  AppreciateAppreciate

I'm stumped. Am I over complicating it? How can I get the output I'm looking for so that I can later drop all the duplicates, except the first, but 'save' the values of the dropped vlaues in the E Column?

Define functions to use in agg and use within groupby . In order to get groupby to work with NaN, I converted to strings then back to floats.

f = {c: ' '.join if c == 'E' else 'first' for c in ['B', 'C', 'D', 'E']}

dfp.groupby(
    dfp.A.astype(str), sort=False
).agg(f).reset_index().eval(
    'A = @pd.to_numeric(A, "coerce").values',
    inplace=False
)

     A    B           C            D                E
0  NaN  1.0   AA1233445     123456.0    Assign Allign
1  3.0  3.0       rmacy    1234567.0    Hello Testing
2  4.0  5.0    Idaho Rx   12345678.0             Ugly
3  5.0  0.0    Ab123455      12345.0  Appreciate Undo
4  1.0  9.0  Ohio Drugs  123456789.0         Unicycle
5  6.0  0.0     RX12345    1234567.0           Pharma
6  7.0  0.0  USA Pharma          NaN          Unicorn

Limiting it to just the duplicated rows:

f = {c: ' '.join if c == 'E' else 'first' for c in ['B', 'C', 'D', 'E']}
d1 = dfp[dfp.duplicated('A', keep=False)]
d2 = d1.groupby(d1.A.astype(str), sort=False).agg(f).reset_index()
d2.A = d2.A.astype(float)

d2

     A    B          C          D                E
0  NaN  1.0  AA1233445   123456.0    Assign Allign
1  3.0  3.0      rmacy  1234567.0    Hello Testing
2  5.0  0.0   Ab123455    12345.0  Appreciate Undo

Here is my ugly solution:

In [263]: (dfp.reset_index()
     ...:     .assign(A=dfp.A.fillna(-1))
     ...:     .groupby('A')
     ...:     .filter(lambda x: len(x) > 1)
     ...:     .groupby('A', as_index=False)
     ...:     .apply(lambda x: x.head(1).assign(E=x.E.str.cat(sep=' ')))
     ...:     .replace({'A':{-1:np.nan}})
     ...:     .set_index('index'))
     ...:
Out[263]:
         A    B          C          D                E
index
0      NaN  1.0  AA1233445   123456.0    Assign Allign
2      3.0  3.0      rmacy  1234567.0    Hello Testing
4      5.0  0.0   Ab123455    12345.0  Appreciate Undo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM