简体   繁体   English

为熊猫DataFrame遍历一系列过滤条件并进行一些计算的最快方法是什么?

[英]What is the fastest way to loop over a list of filter criteria for a pandas DataFrame and do some calculations?

I often find myself with a list of filters that I need to apply to a pandas dataframe. 我经常发现自己需要应用于熊猫数据框的过滤器列表。 I apply each filter and do some calculations, but this often results in slow code. 我应用了每个过滤器并进行了一些计算,但这通常会导致代码变慢。 I would like to optimize the performance. 我想优化性能。 I have created an example of my slow solution which filters a dataframe on a list of date ranges and calculate a sum of a column for the rows that match my date range, and then assign this value to the date matching the start of the date range: 我创建了一个慢速解决方案的示例,该解决方案可以过滤日期范围列表上的数据框,并为与我的日期范围匹配的行计算列的总和,然后将此值分配给与日期范围的开头匹配的日期:

import numpy as np
import pandas as pd
import datetime


def generateTestDataFrame(N=50, windowSizeInDays=5):
    dd = {"AsOfDate" : [],
            "WindowEndDate" : [],
            "X" : []}

    d = datetime.date.today()

    for i in range(N):

        dd["AsOfDate"].append(d)
        dd["WindowEndDate"].append(d + datetime.timedelta(days=windowSizeInDays))
        dd["X"].append(float(i))

        d = d + datetime.timedelta(days=1)

    newDf = pd.DataFrame(dd)
    return newDf

def run():
    numRows = 50
    windowSizeInDays = 5

    print "NumRows: %s" % (numRows)
    print "WindowSizeInDays: %s" % (windowSizeInDays)

    df = generateTestDataFrame(numRows, windowSizeInDays)

    newAggColumnName = "SumOverNdays"
    df[newAggColumnName] = np.nan  # Initialize the column to nan

    for i in range(df.shape[0]):
        row_i = df.iloc[i]
        startDate = row_i["AsOfDate"]
        endDate = row_i["WindowEndDate"]
        sumAggOverNdays = df.loc[ (df["AsOfDate"] >= startDate) & (df["AsOfDate"] < endDate) ]["X"].sum()
        df.loc[df["AsOfDate"] == startDate, newAggColumnName] = sumAggOverNdays  

    print df.head(10)

if __name__ == "__main__":
    run()

This produces the following output: 这将产生以下输出:

NumRows: 50
WindowSizeInDays: 5
     AsOfDate WindowEndDate    X  SumOverNdays
0  2019-01-15    2019-01-20  0.0          10.0
1  2019-01-16    2019-01-21  1.0          15.0
2  2019-01-17    2019-01-22  2.0          20.0
3  2019-01-18    2019-01-23  3.0          25.0
4  2019-01-19    2019-01-24  4.0          30.0
5  2019-01-20    2019-01-25  5.0          35.0
6  2019-01-21    2019-01-26  6.0          40.0
7  2019-01-22    2019-01-27  7.0          45.0
8  2019-01-23    2019-01-28  8.0          50.0
9  2019-01-24    2019-01-29  9.0          55.0

Try using pandas.DataFrame.apply() for calculations. 尝试使用pandas.DataFrame.apply()进行计算。

doc: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html doc: https//pandas.pydata.org/pandas-docs/stable/genic/pandas.DataFrame.apply.html

Using your code: 使用您的代码:

%%timeit
run()
205 ms ± 33.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Adapting: 适应:

%%timeit
windowSizeInDays = 5
rows = 50
df_ = pd.DataFrame(index=range(rows),columns=['AsOfDate','WindowEndDate','X','SumOverNdays'])
asofdate = [datetime.date.today() + datetime.timedelta(days=i) for i in range(rows)]
windowenddate = [i + datetime.timedelta(days=windowSizeInDays) for i in asofdate]

df_['AsOfDate'] = asofdate
df_['WindowEndDate'] = windowenddate
df_['X'] = np.arange(float(df_.shape[0]))
df_['SumOverNdays'] = df_.apply(lambda x: df_.loc[ (df_["AsOfDate"] >= x['AsOfDate']) & (df_["AsOfDate"] < x['WindowEndDate']) ]["X"].sum(), axis=1)
df_
112 ms ± 3.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Not a BIG difference but in this particular example we can't do better than that... 差别不是很大 ,但在这个特殊的例子中,我们不能做得更好?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM