简体   繁体   中英

Pandas groupby and replace duplicates with empty string

I have a dataframe like the following:

import pandas as pd

d = {'one':[1,1,1,1,2, 2, 2, 2],
     'two':['a','a','a','b', 'a','a','b','b'],
     'letter':[' a','b','c','a', 'a', 'b', 'a', 'b']}

df = pd.DataFrame(d)
>    one two letter
0    1   a      a
1    1   a      b
2    1   a      c
3    1   b      a
4    2   a      a
5    2   a      b
6    2   b      a
7    2   b      b

And I am trying to convert it to a dataframe like the following, where empty cells are filled with empty string '':

one  two  letter
1    a    a        
          b        
          c         
     b    a         
2    a    a         
          b         
     b    a         
          b          

When I perform groupby with all columns I get a series object that is basically exactly what I am looking for, but not a dataframe:

df.groupby(df.columns.tolist()).size()   
1    a    a         1
          b         1
          c         1
     b    a         1
2    a    a         1
          b         1
     b    a         1
          b         1

How can I get the desired dataframe?

You can mask your columns where the value is not the same as the value below, then use where to change it to a blank string:

df[['one','two']] = df[['one','two']].where(df[['one', 'two']].apply(lambda x: x != x.shift()), '')

>>> df
  one two letter
0   1   a      a
1              b
2              c
3       b      a
4   2   a      a
5              b
6       b      a
7              b

some explanation :

Your mask looks like this:

>>> df[['one', 'two']].apply(lambda x: x != x.shift())
     one    two
0   True   True
1  False  False
2  False  False
3  False   True
4   True   True
5  False  False
6  False   True
7  False  False

All that where is doing is finding the values where that is true, and replacing the rest with ''

The solution to the original problem is to find the dublicated cells in each of the first two columns and set them to empty:

df.loc[df.duplicated(subset=['one', 'two']), 'two'] = ''
df.loc[df.duplicated(subset=['one']),        'one'] = ''

However, the purpose of this transformation is unclear. Perhaps you are trying to solve a wrong problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM