简体   繁体   中英

Pandas duplicated rows with missing values

Hello I have a dataframe that contains duplicates.

df = pd.DataFrame({'id':[1,1,1], 
                   'name':['Hamburg','Hamburg','Hamburg'], 
                   'country':['Germany','Germany',None],
                   'state':[None,None,'Hamburg']})

removing the duplicates with df.drop_duplicates() returns:

在此处输入图像描述

How can I configure drop_duplicates such that only one row is left, that contains all the information?

In your very special case, here's my proposal:

import pandas
df = pandas.DataFrame({'id':[1,1,1,2,2], 
                   'name':['Hamburg','Hamburg','Hamburg','Paris','Paris'], 
                   'country':['Germany','Germany',None, None, 'France'],
                   'state':[None,None,'Hamburg', 'Paris', None]})

df_result=pandas.DataFrame()
for id in df['id'].unique().tolist() :
    df_subset=df[df['id']==id].copy(deep=True)
    df_subset.sort_values(by=['id','name','country','state'],inplace=True)
    df_subset.bfill(inplace=True)
    df_subset.ffill(inplace=True)
    df_subset.drop_duplicates(inplace=True)
    df_result=df_result.append(df_subset)

df=df_result

Out[18]: 
   id     name  country    state
0   1  Hamburg  Germany  Hamburg
4   2    Paris   France    Paris

Subsetting the records will avoid ffill or bfill to fill adjacent but different id records.

Regards

In the case you have no row with all the information at once, you can use groupby and first but first fillna None with np.nan to work with missing values:

print (df.fillna(value=np.nan).groupby('id').first())
       name  country    state
id                           
1   Hamburg  Germany  Hamburg

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM