简体   繁体   English

如何在大熊猫数据框中加快迭代功能?

[英]How can I speed up an iterative function on my large pandas dataframe?

I am quite new to pandas and I have a pandas dataframe of about 500,000 rows filled with numbers. 我对熊猫还很陌生,我有一个大约有500,000行的熊猫数据框,上面有数字。 I am using python 2.x and am currently defining and calling the method shown below on it. 我正在使用python 2.x,目前正在定义和调用下面显示的方法。 It sets a predicted value to be equal to the corresponding value in series 'B', if two adjacent values in series 'A' are the same. 如果系列“ A”中的两个相邻值相同,则它将预测值设置为等于系列“ B”中的对应值。 However, it is running extremely slowly, about 5 rows are outputted per second and I want to find a way accomplish the same result more quickly. 但是,它运行非常慢,每秒大约输出5行,我想找到一种更快地完成相同结果的方法。

def myModel(df):

    A_series = df['A']
    B_series = df['B']
    seriesLength = A_series.size

    # Make a new empty column in the dataframe to hold the predicted values
    df['predicted_series'] = np.nan

    # Make a new empty column to store whether or not
    # prediction matches predicted matches B
    df['wrong_prediction'] = np.nan
    prev_B = B_series[0]
    for x in range(1, seriesLength):

        prev_A = A_series[x-1]  
        prev_B = B_series[x-1]
        #set the predicted value to equal B if A has two equal values in a row
        if A_series[x] == prev_A:
            if df['predicted_series'][x] > 0:
                 df['predicted_series'][x] = df[predicted_series'][x-1]
            else:
                 df['predicted_series'][x] = B_series[x-1]

Is there a way to vectorize this or to just make it run faster? 有没有一种方法可以对此进行矢量化或使其运行更快? Under the current circumstances, it is projected to take many hours. 在当前情况下,预计将花费许多小时。 Should it really be taking this long? 真的需要这么长时间吗? It doesn't seem like 500,000 rows should be giving my program that much problem. 看来500,000行应该不会给我的程序带来太大问题。

像您描述的那样,这样的事情应该起作用:

df['predicted_series'] = np.where(A_series.shift() == A_series, B_series, df['predicted_series'])
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B  

This will get rid of the for loop and set predicted_series to the value of B when A is equal to previous A. 这将摆脱for循环,并在A等于先前的A时将Forecast_series设置为B的值。

edit: 编辑:

per your comment, change your initialization of predicted_series to be all NAN and then front fill the values: 根据您的评论,将您的Forecast_series初始化更改为全部NAN,然后预先填充以下值:

df['predicted_series'] = np.nan
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
df.predicted_series = df.predicted_series.fillna(method='ffill')

For fastest speed modifying ayhans answer a bit will perform best: 对于最快的修改ayhans速度,回答会更好:

df['predicted_series'] = np.where(df.A.shift() == df.A, df.B, df['predicted_series'].shift())

That will give you your forward filled values and run faster than my original recommendation 这将为您提供向前填充的值,并且比我最初的建议运行得更快

df.loc[df.A == df.A.shift()] = df.B.shift()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM