简体   繁体   中英

Calculate a rolling window weighted average on a Pandas column

I'm relatively new to python, and have been trying to calculate some simple rolling weighted averages across rows in a pandas data frame. I have a dataframe of observations df and a dataframe of weights w . I create a new dataframe to hold the inner-product between these two sets of values, dot .

As w is of smaller dimension, I use a for loop to calculate the weighted average by row, of the leading rows equal to the length of w .

More clearly, my set-up is as follows:

import pandas as pd

df = pd.DataFrame([0,1,2,3,4,5,6,7,8], index = range(0,9))

w = pd.DataFrame([0.1,0.25,0.5], index = range(0,3))

dot = pd.DataFrame(0, columns = ['dot'], index = df.index)

for i in range(0,len(df)):
    df.loc[i] = sum(df.iloc[max(1,(i-3)):i].values * w.iloc[-min(3,(i-1)):4].values) 

I would expect the result to be as follows (ie when i = 4 )

dot.loc[4] = sum(df.iloc[max(1,(4-3)):4].values * w.iloc[-min(3,(4-1)):4].values)
print dot.loc[4]  #2.1

However, when running the for loop above, I receive the error:

ValueError: operands could not be broadcast together with shapes (0,1) (2,1)

Which is where I get confused - I think it must have to do with how I call i into iloc , as I don't receive shape errors when I manually calculate it, as in the example with 4 above. However, looking at other examples and documentation, I don't see why that's the case... Any help is appreciated.

Your first problem is that you are trying to multiply arrays of two different sizes. For example, when i=0 the different parts of your for loop return

df.iloc[max(1,(0-3)):0].values.shape
# (0,1)

w.iloc[-min(3,(0-1)):4].values.shape    
# (2,1)

Which is exactly the error you are getting. The easiest way I can think of to make the arrays multipliable is to pad your dataframe with leading zeros, using concatenation .

df2 = pd.concat([pd.Series([0,0]),df], ignore_index=True)
df2
    0
0   0
1   0
2   0
3   1
4   2
5   3
6   4
7   5
8   6
9   7
10  8

While you can now use your for loop (with some minor tweaking):

for i in range(len(df)):
    dot.loc[i] = sum(df2.iloc[max(0,(i)):i+3].values * w.values)

A nicer way might be the way JohnE suggested , to use the rolling and apply functions built into pandas, there by getting rid of your for loop

import numpy as np
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w))
    0
0   NaN
1   NaN
2   0.00
3   0.50
4   1.25
5   2.10
6   2.95
7   3.80
8   4.65
9   5.50
10  6.35

You can also drop the first two padding rows and reset the index

df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w)).drop([0,1]).reset_index(drop=True)
    0
0   0.00
1   0.50
2   1.25
3   2.10
4   2.95
5   3.80
6   4.65
7   5.50
8   6.35

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM