简体   繁体   中英

Merge duplicated pandas rows on specific rules

Given the following data frame

df = pd.DataFrame({
    'identifier': ['1', '2', None], 
    'name': ['Tom', 'Peter', 'Peter'], 
    'registered': [True, False, True]
})

the ultimate goal is to merge the data frame grouped by the name and according to certain rules, eg

  • if one of the duplicated identifier is a string and the other is None , then use the string identifier
  • do a logical or to all registered entries

So the result should look like

df_result = pd.DataFrame({
    'identifier': ['1', '2'], 
    'name': ['Tom', 'Peter'], 
    'registered': [True, True]
})

I tried it with groupby , but maybe this is the wrong way at all?

drop_duplicates do not let me to add specific rules.

I think you need custom function with dropna , drop_duplicates and any :

df = pd.DataFrame({
    'identifier': ['1', '2', None, '2'], 
    'name': ['Peter', 'Peter', 'Peter', 'Peter'], 
    'registered': [True, False, True, True]
})
print (df)
  identifier   name  registered
0          1  Peter        True
1          2  Peter       False
2       None  Peter        True
3          2  Peter        True

def f(x):
    return pd.DataFrame({'identifier': x['identifier'].dropna().drop_duplicates(), 
                         'registered': x['registered'].any()})

df = df.groupby('name').apply(f).reset_index(level=1, drop=True).reset_index()
print (df)
    name identifier  registered
0  Peter          1        True
1  Peter          2        True

Let's modify your data slightly.

df = pd.DataFrame({
    'identifier': ['1', None, '2'], 
    'name': ['Tom', 'Peter', 'Peter'], 
    'registered': [True, False, True]
})

df

  identifier   name  registered
0          1    Tom        True
1       None  Peter       False
2          2  Peter        True

A "None" is the first identifier for "Peter". You can remedy this with a sort_values call, following which, you call groupby + agg .

df.sort_values(['identifier'])\
  .groupby('name', as_index=False)\
  .agg({'identifier' : 'first', 'registered' : any})

    name  registered identifier
0  Peter        True          2
1    Tom        True          1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM