Pandas duplicated rows with missing values

Question

Hello I have a dataframe that contains duplicates.

df = pd.DataFrame({'id':[1,1,1], 
                   'name':['Hamburg','Hamburg','Hamburg'], 
                   'country':['Germany','Germany',None],
                   'state':[None,None,'Hamburg']})

removing the duplicates with df.drop_duplicates() returns:

How can I configure drop_duplicates such that only one row is left, that contains all the information?

Answer 1

In your very special case, here's my proposal:

import pandas
df = pandas.DataFrame({'id':[1,1,1,2,2], 
                   'name':['Hamburg','Hamburg','Hamburg','Paris','Paris'], 
                   'country':['Germany','Germany',None, None, 'France'],
                   'state':[None,None,'Hamburg', 'Paris', None]})

df_result=pandas.DataFrame()
for id in df['id'].unique().tolist() :
    df_subset=df[df['id']==id].copy(deep=True)
    df_subset.sort_values(by=['id','name','country','state'],inplace=True)
    df_subset.bfill(inplace=True)
    df_subset.ffill(inplace=True)
    df_subset.drop_duplicates(inplace=True)
    df_result=df_result.append(df_subset)

df=df_result

Out[18]: 
   id     name  country    state
0   1  Hamburg  Germany  Hamburg
4   2    Paris   France    Paris

Subsetting the records will avoid ffill or bfill to fill adjacent but different id records.

Regards

Answer 2

In the case you have no row with all the information at once, you can use groupby and first but first fillna None with np.nan to work with missing values:

print (df.fillna(value=np.nan).groupby('id').first())
       name  country    state
id                           
1   Hamburg  Germany  Hamburg

Pandas duplicated rows with missing values

Question

2 answers

solution1
1 2020-04-27 15:45:08

solution2
1 ACCPTED 2020-04-27 15:45:45

Pandas duplicated rows with missing values

Question

2 answers

solution1 1 2020-04-27 15:45:08

solution2 1 ACCPTED 2020-04-27 15:45:45

solution1
1 2020-04-27 15:45:08

solution2
1 ACCPTED 2020-04-27 15:45:45