I need to group by a subset of columns and count the number of distinct combinations of their values. However, there are other columns that may or may not have distinct values, and I want to somehow retain this information in my output. Here is an example:
gb1 gb2 text1 text2
bebop skeletor blue fisher
bebop skeletor blue wright
rocksteady beast_man orange haldane
rocksteady beast_man orange haldane
tokka kobra_khan green lande
tokka kobra_khan red arnold
I only want to group by gb1
and gb2
.
Here is what I need:
gb1 gb2 count text1 text2
bebop skeletor 2 blue fisher, wright
rocksteady beast_man 2 orange haldane
tokka kobra_khan 2 green, red lande, arnold
I've got everything working except for handling the text1
and text2
columns.
Thanks in advance.
You can check with
s=df.assign(count=1).groupby(['gb1','gb2']).agg({'count':'sum','text1':lambda x : ','.join(set(x)),'text2':lambda x : ','.join(set(x))}).reset_index()
s
gb1 gb2 count text1 text2
0 bebop skeletor 2 blue wright,fisher
1 rocksteady beast_man 2 orange haldane
2 tokka kobra_khan 2 green,red lande,arnold
You can use a combination of apply and transform :
If df
is your original dataframe:
def combine(xx):
dd = xx.transform(lambda x : ','.join(set(x)))
dd['count'] = len(xx)
return dd
ddf = df.groupby(['gb1', 'gb2']).apply(combine)
With your sample dataframe, ddf
is:
text1 text2 count
gb1 gb2
bebop skeletor blue fisher,wright 2
rocksteady beast_man orange haldane 2
tokka kobra_khan red,green lande,arnold 2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.