简体   繁体   中英

comparing within column in pandas groupby

Lets say I have a dataframe:

df = pd.DataFrame({'a':[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3], 
              'b':[2016, 2017, 2018, 2019, 2000, 2000, 2000, 2000, 2007, 2008, 2014]})

I want to groupby this data and compare the number of years within groups a that are less than others.

within group 1, 2016 is the 0th year, then 2017 is the 1st year (or has 1 year behind it).

I tried doing:

df['c'] = df.groupby('a')['b'].apply(lambda x: [sum(y > x) for y in x]).reset_index(drop=False)

but this is taking a really long time. I was wondering if there is a better way to do it. I'm working on 6.5M rows.

expected output:

    a     b    c
 0  1  2016    0  
 1  1  2017    1 
 2  1  2018    2
 3  1  2019    3
 4  2  2000    0
 5  2  2000    0
 6  2  2000    0
 7  2  2000    0
 8  3  2007    0
 9  3  2008    1
10  3  2014    2

I would use rank and afterwards sub 1 , which I think is very readable, and approximately twice as fast as the other answer and ~3.5 x as fast as the original approach:

df.groupby(['a'])['b'].rank('min').sub(1)

#0     0.0
#1     1.0
#2     2.0
#3     3.0
#4     0.0
#5     0.0
#6     0.0
#7     0.0
#8     0.0
#9     1.0
#10    2.0

%timeit df.groupby(['a'])['b'].rank('min').sub(1)
#1.58 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.groupby('a')['b'].transform(lambda x: pd.factorize(x)[0])
#3.76 ms ± 330 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.groupby('a')['b'].apply(lambda x: [sum(y > x) for y in x]).reset_index(drop=False)
#5.32 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy solution

from scipy.stats import rankdata
np.concatenate([rankdata(x,method='min')for x in (np.split(df.b.values,np.flatnonzero(df.a.diff().fillna(0))))])-1

%timeit df.groupby(['a'])['b'].rank('min').sub(1)
1000 loops, best of 3: 845 µs per loop
%timeit df.groupby('a')['b'].transform(lambda x: pd.factorize(x)[0])
100 loops, best of 3: 1.77 ms per loop
%timeit df.groupby('a')['b'].apply(lambda x: [sum(y > x) for y in x]).reset_index(drop=False)
100 loops, best of 3: 2.71 ms per loop
%timeit np.concatenate([rankdata(x,method='min')for x in (np.split(df.b.values,np.flatnonzero(df.a.diff().fillna(0))))])-1
1000 loops, best of 3: 342 µs per loop

Clean and efficient solution:

import pandas as pd

df = pd.DataFrame({'a':[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3], 
              'b':[2016, 2017, 2018, 2019, 2000, 2000, 2000, 2000, 2007, 2008, 2014]})

df=df.set_index('a')
df=df.sort_index()

quickmap={}

for index in df.index.unique():
    temphash={}
    val=0
    for i in df.loc[index]['b'].unique():
        temphash[i]=val
        val+=1
    quickmap[index]=temphash

df=df.reset_index()    

def toret(row):
    key=row['a']
    subkey=row['b']
    return quickmap[key][subkey]

df['c']=df.apply(toret,axis=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM