简体   繁体   中英

Filling nan based on reverse moving average

To start with, the simplified example here is a small dataframe with some nans:

    A   B   C
0   NaN NaN NaN
1   NaN NaN NaN
2   2.0 1.0 NaN
3   2.0 NaN NaN
4   0.0 4.0 2.0
5   NaN 2.0 5.0
6   NaN 3.0 1.0

And my goal is to fill all the NaN in column C(just ignore A and B, they are here to make it a dataframe) so that it will look like this:

    A   B   C
0   NaN NaN 2.839506
1   NaN NaN 2.629630
2   2.0 1.0 3.222222
3   2.0 NaN 2.666667
4   0.0 4.0 2.0
5   NaN 2.0 5.0
6   NaN 3.0 1.0

In a reverse manner, each nan is filled with the moving average of previous three values, like 2.666667 = (2.0+5.0+1.0), and 3.222222 = (2.666667+5.0+2.0). In this way the whole column will be fully filled without leaving nan.

I have been trying some solutions here using pd.rolling(window = n, min_periods = 1) with shift() but they failed to do that. Also since this is a simplified example while the full datasets have more than 30000 rows(with 20% missing values), a for loop would be time-consuming. There should be a very clear and elegant way without using df[::-1] - to reverse the whole series, get rolling means, then reverse it back - but even this trick cannot work.

Pandas doesn't support rolling with side effect. I can only think of loop as the approach to your problem. Looping with 30,000 rows is not a big problem; repeated calling df.loc is because that function is pretty slow.

You can convert C to a numpy array for speed:

reversed_c = df["C"].to_numpy()[::-1]
for i, value in enumerate(reversed_c):
    if i < 3 or ~np.isnan(value):
        continue
    reversed_c[i] = np.mean(reversed_c[i-3:i])
df["C"] = reversed_c[::-1]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM