简体   繁体   English

Pandas:在大型数据帧上滚动排名的性能

[英]Pandas: Performance For Rolling Rank On Large Dataframes

I am looking to migrate a statistical analysis project to pandas.我希望将统计分析项目迁移到 Pandas。 I would like to rank 3 columns over a rolling window of N days.我想在 N 天的滚动窗口中对 3 列进行排名。 I have found methods to do this as answered in this question [rank-data-over-a-rolling-window][1], but the performance isn't adequate for my data set (45K rows).我已经找到了在这个问题 [rank-data-over-a-rolling-window][1] 中回答的方法,但性能不足以满足我的数据集(45K 行)。 The fastest way I have found is to use the bottleneck library or numpy argsort as below.我发现的最快方法是使用瓶颈库或 numpy argsort,如下所示。 This has dramatically improved performance, but is still some way off when compared to the rolling_mean function which should have similar performance.这显着提高了性能,但与应该具有相似性能的 rolling_mean 函数相比仍有一段距离。

EDIT: I have updated the below code to provide a reproducible example with timings.编辑:我已经更新了下面的代码以提供一个可重现的时间示例。 The series rank function is the most flexible allowing me to choose how to rank ties, but is very slow.系列排名功能是最灵活的,可以让我选择如何对关系进行排名,但速度很慢。 The best two I can find are the bottleneck method or argsort.我能找到的最好的两个是瓶颈方法或 argsort。 Both are comparable in performance but are restrictive on their handling of ties.两者在性能上相当,但在处理关系方面受到限制。 However both are still considerably slower when compared to rolling mean?然而,与滚动平均值相比,两者仍然相当慢?

rollWindow = 240
df = pd.DataFrame(np.random.randn(100000,4), columns=list('ABCD'), index=pd.date_range('1/1/2000', periods=100000, freq='1H'))
df.iloc[-3:-1]['A'] = 7.5
df.iloc[-1]['A'] = 5.5

df["SER_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankOnSeries)
 # 28.9secs (allows competition/min ranking for ties)

df["SCIPY_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankSciPy)
 # 70.89secs (allows competition/min ranking for ties)

df["BNECK_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankBottleneck)
 # 3.64secs (only provides average ranking for ties)

df["ASRT_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankArgSort)
 # 3.56secs (only provides competition/min ranking for ties)

df["MEAN"] = pd.rolling_mean(df['A'], window=rollWindow)
 # 0.008secs

def rollingRankOnSeries (array):
    s = pd.Series(array)
    return s.rank(method='min', ascending=False)[len(s)-1]

def rollingRankSciPy (array):
     return array.size + 1 - sc.rankdata(array)[-1]

def rollingRankBottleneck (array):
    return array.size + 1 - bd.rankdata(array)[-1]

def rollingRankArgSort (array):
    return array.size - array.argsort().argsort()[-1]


                        A   SER_RK  SCIPY_RK  BNECK_RK  ASRT_RK     MEAN  
2011-05-29 11:00:00  1.37       23      23.0      23.0     23   0.013526  
2011-05-29 12:00:00  0.45       85      85.0      85.0     85   0.016833   
2011-05-29 13:00:00  7.50        1       1.0       1.0      1   0.049606   
2011-05-29 14:00:00  7.50        1       1.5       1.5      1   0.083655   
2011-05-29 15:00:00  5.50        3       3.0       3.0      3   0.112001 

I have previously implemented moving window statistics by maintaining the difference between each window (online) to easily calculate the change in rank where as it appears I currently have to completely re-rank every window, which is unnecessary.我以前通过维护每个窗口(在线)之间的差异来实现移动窗口统计,以轻松计算排名的变化,目前看来我必须完全重新排名每个窗口,这是不必要的。 I have seen that a similar question has been asked previously [Pandas performance on rolling stats][2].我看到之前有人问过一个类似的问题 [Pandas 在滚动统计上的表现][2]。

  1. Do you know if there is a way in pandas to perform this calculation more efficiently?您知道熊猫是否有办法更有效地执行此计算?
  2. Is there an easy way to implement a function on a moving window in pandas where I can find the element(s) added and removed for each step and return a value accordingly, possibly maintaining my own running rank calculation?有没有一种简单的方法可以在 Pandas 的移动窗口上实现一个函数,我可以在其中找到为每个步骤添加和删除的元素并相应地返回一个值,可能维护我自己的运行排名计算?

Thanks谢谢

[1]: http://stackoverflow.com/questions/14440187/rank-data-over-a-rolling-window-in-pandas-dataframe
[2]: http://stackoverflow.com/questions/24613850/pandas-performance-on-multiple-rolling-statistics-on-different-time-intervals

the documentation here does what you are describing I believe.我相信这里的文档做了你所描述的。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000',
    periods=1000))
plot1 = pd.rolling_max(ts, 240)
plot2 = pd.rolling_min(ts, 240)
plot3 = pd.rolling_mean(ts, 240)

plt.plot(plot1.values.tolist())
plt.plot(plot2.values.tolist())
plt.plot(plot3.values.tolist())
plt.show()

This is how Pandas is optimized to perform the task.这就是 Pandas 被优化以执行任务的方式。 If this is not fast enough, I'm not sure that a workaround will be faster than the built in function.如果这还不够快,我不确定解决方法是否会比内置函数更快。 If this is redundant feel free to downvote :)如果这是多余的,请随意投票:)

EDIT: is this more what you were talking about?编辑:这更多的是你在说什么?

ts = pd.Series(np.random.randn(1000000), index=pd.date_range('1/1/2000', periods=1000000))

listofmax = []
for number in range(0, len(ts), 240):
    listofmax.append(ts[number:number+240].max())

with 1million rows it took .4 seconds according to timeit.根据 timeit,有 100 万行需要 0.4 秒。 Granted this is just a datetime stamp and a value.当然,这只是一个日期时间戳和一个值。 are you looking for something quicker than this, and am I better understanding what you have tried?您是否正在寻找比这更快的东西,我是否更了解您的尝试?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM