简体   繁体   中英

How to speed up rank function in pandas series?

I want to roll over to calculate the rank of a series.

Assume I have a pandas series:

In [18]: s = pd.Series(np.random.rand(10))

In [19]: s
Out[19]: 
0    0.340396
1    0.664459
2    0.647212
3    0.529363
4    0.535349
5    0.781628
6    0.313549
7    0.933539
8    0.618337
9    0.013442
dtype: float64

I can use pandas inner function rank like this:

In [20]: s.rolling(4).apply(lambda x: pd.Series(x).rank().iloc[-1])
<ipython-input-20-41df4deb36f8>:1: FutureWarning: Currently, 'apply' passes the values as ndarrays to the applied function. In the future, this will change to passing it as Series objects. You need to specify 'raw=True' to keep the current behaviour, and you can pass 'raw=False' to silence this warning
  s.rolling(4).apply(lambda x: pd.Series(x).rank().iloc[-1])
Out[20]: 
0    NaN
1    NaN
2    NaN
3    2.0
4    2.0
5    4.0
6    1.0
7    4.0
8    2.0
9    1.0
dtype: float64

This is ok, but it's quite slow, here is a test.

In [24]: %timeit pd.Series(np.random.rand(100000)).rolling(100).apply(lambda x: pd.Series(x).rank().iloc[-1])
<magic-timeit>:1: FutureWarning: Currently, 'apply' passes the values as ndarrays to the applied function. In the future, this will change to passing it as Series objects. You need to specify 'raw=True' to keep the current behaviour, and you can pass 'raw=False' to silence this warning
22.5 s ± 292 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is there any good method i can use to speed up, i think the rolling loop have something can do to improve. thanks

It's faster with scipy/numpy (requires the latest version of numpy ):

import pandas as pd
import numpy as np
from time import time
from scipy.stats import rankdata
from numpy.lib.stride_tricks import sliding_window_view

np.random.seed()
array = np.random.rand(100000)

t0 = time()
ranks = pd.Series(array).rolling(100).apply(lambda x: x.rank().iloc[-1])
t1 = time()
print(f'With pandas: {t1-t0} sec.')

t0 = time()
ranks = [rankdata(x)[-1] for x in sliding_window_view(array, window_shape=100)]
t1 = time()
print(f'With numpy: {t1-t0} sec.')

Output:

With pandas: 11.682222127914429 sec.
With numpy: 3.9317219257354736 sec.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM