To start with, the simplified example here is a small dataframe with some nans:
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 2.0 1.0 NaN
3 2.0 NaN NaN
4 0.0 4.0 2.0
5 NaN 2.0 5.0
6 NaN 3.0 1.0
And my goal is to fill all the NaN in column C(just ignore A and B, they are here to make it a dataframe) so that it will look like this:
A B C
0 NaN NaN 2.839506
1 NaN NaN 2.629630
2 2.0 1.0 3.222222
3 2.0 NaN 2.666667
4 0.0 4.0 2.0
5 NaN 2.0 5.0
6 NaN 3.0 1.0
In a reverse manner, each nan is filled with the moving average of previous three values, like 2.666667 = (2.0+5.0+1.0), and 3.222222 = (2.666667+5.0+2.0). In this way the whole column will be fully filled without leaving nan.
I have been trying some solutions here using pd.rolling(window = n, min_periods = 1)
with shift()
but they failed to do that. Also since this is a simplified example while the full datasets have more than 30000 rows(with 20% missing values), a for loop would be time-consuming. There should be a very clear and elegant way without using df[::-1]
- to reverse the whole series, get rolling means, then reverse it back - but even this trick cannot work.
Pandas doesn't support rolling with side effect. I can only think of loop as the approach to your problem. Looping with 30,000 rows is not a big problem; repeated calling df.loc
is because that function is pretty slow.
You can convert C
to a numpy array for speed:
reversed_c = df["C"].to_numpy()[::-1]
for i, value in enumerate(reversed_c):
if i < 3 or ~np.isnan(value):
continue
reversed_c[i] = np.mean(reversed_c[i-3:i])
df["C"] = reversed_c[::-1]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.