简体   繁体   中英

python pandas groupby optimisation

I have a large dataframe of many rows and columns and I need to groupby one of the columns 'group' here is a small example

  group      rank             word
0     a  0.739631           entity
1     a  0.882556  physical_entity
2     b  0.588045      abstraction
3     b  0.640933            thing
4     c  0.726738           object
5     c  0.669280            whole
6     d  0.006574         congener
7     d  0.308684     living_thing
8     d  0.638631         organism
9     d  0.464244          benthos

Basically, I will be applying a series of functions to create new columns and transform existing ones after the group by, for instance:

One of the functions I want to implement is top_word which selects the highest ranked word for each group. So its output would be a unicode column:

group    top_word
a    physical_entity [0.88]
b    thing [0.64]
c    object [0.73]
d    organism [0.63]

Currently, I'm using this horrendous method:

def top_word(tab):
    first = tab.iloc[0]
    res = '{} [{:.2f}]'.format(first['word'], first['rank'])
    return [res]

def aggr(x, fns):
    d = {key: fn(x) for key, fn in fns.iteritems()}
    return pd.DataFrame(d)

fs = {'top_word': top_word}
T = T.sort('rank', ascending=False) #sort by rank then I only have to pick the first result in the aggfunc!
T = T.groupby('group', sort=False).apply(lambda x: aggr(x, fs))
T.index = T.index.droplevel(level=1)

which gives (different result due to random number generator for example):

time taken: 0.0042  +- 0.0003 seconds
                 top_word
group                    
a           entity [0.07]
b      abstraction [0.84]
c           object [0.92]
d         congener [0.06]

I've designed this method so I can apply any function I wish to the table at any point. It needs to stay this flexible, but it just seems horrible! Is there a more efficient way to do something like this? Is iterating over groups + appending better?

Thanks

I think the idea is to groupby first, then sort each group and keep the first observation using .agg() :

In [192]:

print df
  group      rank             word
0     a  0.739631           entity
1     a  0.882556  physical_entity
2     b  0.588045      abstraction
3     b  0.640933            thing
4     c  0.726738           object
5     c  0.669280            whole
6     d  0.006574         congener
7     d  0.308684     living_thing
8     d  0.638631         organism
9     d  0.464244          benthos
In [193]:

print df.groupby('group').agg(lambda x: sorted(x, reverse=True)[0])
           rank             word
group                           
a      0.882556  physical_entity
b      0.640933            thing
c      0.726738            whole
d      0.638631         organism
In [194]:

df_res = df.groupby('group').agg(lambda x: sorted(x, reverse=True)[0])
df_res.word+df_res['rank'].apply(lambda x: ' [%.2f]'%x)
Out[194]:
group
a        physical_entity [0.88]
b                  thing [0.64]
c                  whole [0.73]
d               organism [0.64]
dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM