I have a large dataframe of many rows and columns and I need to groupby one of the columns 'group' here is a small example
group rank word
0 a 0.739631 entity
1 a 0.882556 physical_entity
2 b 0.588045 abstraction
3 b 0.640933 thing
4 c 0.726738 object
5 c 0.669280 whole
6 d 0.006574 congener
7 d 0.308684 living_thing
8 d 0.638631 organism
9 d 0.464244 benthos
Basically, I will be applying a series of functions to create new columns and transform existing ones after the group by, for instance:
One of the functions I want to implement is top_word
which selects the highest ranked word for each group. So its output would be a unicode column:
group top_word
a physical_entity [0.88]
b thing [0.64]
c object [0.73]
d organism [0.63]
Currently, I'm using this horrendous method:
def top_word(tab):
first = tab.iloc[0]
res = '{} [{:.2f}]'.format(first['word'], first['rank'])
return [res]
def aggr(x, fns):
d = {key: fn(x) for key, fn in fns.iteritems()}
return pd.DataFrame(d)
fs = {'top_word': top_word}
T = T.sort('rank', ascending=False) #sort by rank then I only have to pick the first result in the aggfunc!
T = T.groupby('group', sort=False).apply(lambda x: aggr(x, fs))
T.index = T.index.droplevel(level=1)
which gives (different result due to random number generator for example):
time taken: 0.0042 +- 0.0003 seconds
top_word
group
a entity [0.07]
b abstraction [0.84]
c object [0.92]
d congener [0.06]
I've designed this method so I can apply any function I wish to the table at any point. It needs to stay this flexible, but it just seems horrible! Is there a more efficient way to do something like this? Is iterating over groups + appending better?
Thanks
I think the idea is to groupby
first, then sort
each group
and keep the first observation using .agg()
:
In [192]:
print df
group rank word
0 a 0.739631 entity
1 a 0.882556 physical_entity
2 b 0.588045 abstraction
3 b 0.640933 thing
4 c 0.726738 object
5 c 0.669280 whole
6 d 0.006574 congener
7 d 0.308684 living_thing
8 d 0.638631 organism
9 d 0.464244 benthos
In [193]:
print df.groupby('group').agg(lambda x: sorted(x, reverse=True)[0])
rank word
group
a 0.882556 physical_entity
b 0.640933 thing
c 0.726738 whole
d 0.638631 organism
In [194]:
df_res = df.groupby('group').agg(lambda x: sorted(x, reverse=True)[0])
df_res.word+df_res['rank'].apply(lambda x: ' [%.2f]'%x)
Out[194]:
group
a physical_entity [0.88]
b thing [0.64]
c whole [0.73]
d organism [0.64]
dtype: object
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.