简体   繁体   中英

Pandas calculate max possible rolling_mean up to window size

I'm trying to recreate the smoothing functionality of the Google Ngram Viewer using Pandas' rolling_mean function. Everything is great except for the last N rows (where N is equal to the window size chosen) result in NaN. I understand why the NaN exists, but I'm wondering if there is a way to force Pandas to calculate the last N rows with the maximum window size possible.

Starting DataFrame:

  y mc vc g freq 0 1980 2110 891 acorn 0.0000006816639806737 1 1981 2493 925 acorn 0.0000007869870441530 2 1982 1970 969 acorn 0.0000006058489961744 3 1983 1974 942 acorn 0.0000005869087043278 4 1984 2265 962 acorn 0.0000006284175013608 5 1985 2331 1002 acorn 0.0000006287865167972 6 1986 2288 1036 acorn 0.0000005938515224444 7 1987 2975 1081 acorn 0.0000007639327989758 8 1988 2562 1164 acorn 0.0000006201948589259 9 1989 2773 1271 acorn 0.0000006308818219374 10 1990 3230 1449 acorn 0.0000006736596925364 11 1991 3984 1279 acorn 0.0000008445218584394 12 1992 2908 1349 acorn 0.0000005616418361769 13 1993 3511 1522 acorn 0.0000006673125583208 14 1994 3623 1709 acorn 0.0000006391704741358 15 1995 3836 1760 acorn 0.0000006497943728333 16 1996 4304 1910 acorn 0.0000006909335126709 17 1997 4107 1954 acorn 0.0000006390261435505 18 1998 4469 1993 acorn 0.0000006660007460970 19 1999 4494 2141 acorn 0.0000006233081676193 20 2000 4827 2304 acorn 0.0000006135668877077 

When I do this:

df['freq_average'] = pd.rolling_mean(df['freq'],5,min_periods=0,center=True)

I get this result:

  y mc vc g freq freq_average 0 1980 2110 891 acorn 0.0000006816639806737 0.0000006531021239145 1 1981 2493 925 acorn 0.0000007869870441530 0.0000006446377522759 2 1982 1970 969 acorn 0.0000006058489961744 0.0000006595496331134 3 1983 1974 942 acorn 0.0000005869087043278 0.0000006551768804259 4 1984 2265 962 acorn 0.0000006284175013608 0.0000006527473745770 5 1985 2331 1002 acorn 0.0000006287865167972 0.0000006546484943915 6 1986 2288 1036 acorn 0.0000005938515224444 0.0000006694537560066 7 1987 2975 1081 acorn 0.0000007639327989758 0.0000006489678280088 8 1988 2562 1164 acorn 0.0000006201948589259 0.0000006545554245675 9 1989 2773 1271 acorn 0.0000006308818219374 0.0000006593064945501 10 1990 3230 1449 acorn 0.0000006736596925364 0.0000006612498465021 11 1991 3984 1279 acorn 0.0000008445218584394 0.0000006668995733997 12 1992 2908 1349 acorn 0.0000005616418361769 0.0000006710063571366 13 1993 3511 1522 acorn 0.0000006673125583208 0.0000006621034432386 14 1994 3623 1709 acorn 0.0000006391704741358 0.0000006623864713016 15 1995 3836 1760 acorn 0.0000006497943728333 0.0000006608123863716 16 1996 4304 1910 acorn 0.0000006909335126709 NaN 17 1997 4107 1954 acorn 0.0000006390261435505 NaN 18 1998 4469 1993 acorn 0.0000006660007460970 NaN 19 1999 4494 2141 acorn 0.0000006233081676193 NaN 20 2000 4827 2304 acorn 0.0000006135668877077 NaN 

So what I'm looking for is a way to calculate the above results, but then have index 16 (in this case) calculated with window size of 4 (instead of the original 5), index 17 calculated with a window size of 3, and so on.

If you look at the results from the Google Ngram Viewer , the index 16-20 should result in the following:

  y mc vc g freq freq_average 16 1996 4304 1910 acorn 0.0000006909335126709 0.0000659528 17 1997 4107 1954 acorn 0.0000006390261435505 0.0000638973 18 1998 4469 1993 acorn 0.0000006660007460970 0.0000648639 19 1999 4494 2141 acorn 0.0000006233081676193 0.0000645971 20 2000 4827 2304 acorn 0.0000006135668877077 0.0000647105 

I've been banging my head against this for a day or so and have had no luck. Any direction is much appreciated!

Just to mention, I have to agree with Andy Hayden, when he says only the last two lines should be NaN, as you are using center = True (then it will average from previous 2 to next 2).

And Python will automatically do what you need in the first rows (take the average of what's available) but won't do it at the bottom. The logic is to get 2 previous values and 2 next values if available.

So, to follow the logic of the top rows:

for i in xrange(2):
    index = i + 19
    df['freq_average'] = sum(df['freq'].iloc[index-2:21] / (20-index+3)

This will take the average of the set of two previous values (index -2) until the end (21). This is oriented to your specific problem. For different windows you need to adapt.

According to 'help(pd.rolling_mean)' setting min_periods=0 (as you did) should do what you are looking for. However, in pandas 0.14.1, there's a bug in the implementation of rolling_* functions that causes NaNs to be put at the end when using center=True. The bug report is at https://github.com/pydata/pandas/issues/6795 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM