简体   繁体   中英

Replace NaN values in a Pandas DataFrame column after using centered .rolling() with first computed sum

I'm fairly new to Pandas and this is also my first actual question Stackoverflow, so please bear with me.

I'm transforming a DataFrame with a MultiIndex. I have to calculate a moving sum of five observations each and doing it centered. I've done that while using groupby, such that the rolling sum is calculated within each group, which is gender, age, and type grouped. However, that means the first and last two rows within each group are NaN. I want the first two NaN values to be equal to the third and the last two to be equal to the 3rd last.

This is original DataFrame

    Gender    Type   Age    Value
1   'f'       A      1       654
2   'f'       A      2       665
3   'f'       A      3       684
4   'f'       A      4       688
5   'f'       A      5       651
6   'f'       A      6       650
7   'f'       A      7       698
8   'f'       A      8       689
9   'f'       A      9       648
10  'f'       A      10      654
11  'f'       B      1       623
12  'f'       B      2       620
13  'f'       B      3       623
14  'f'       B      4       653
15  'f'       B      5       653
16  'f'       B      6       642
17  'f'       B      7       632
18  'f'       B      8       632
19  'f'       B      9       644
20  'f'       B      10      654
21  'm'       A      1       623
22  'm'       A      2       624
23  'm'       A      3       600
24  'm'       A      4       642
25  'm'       A      5       622
26  'm'       A      6       623
27  'm'       A      7       633
28  'm'       A      8       635
29  'm'       A      9       653
30  'm'       A      10      623
31  'm'       B      1       623
32  'm'       B      2       632
33  'm'       B      3       632
34  'm'       B      4       683
35  'm'       B      5       652
36  'm'       B      6       655
37  'm'       B      7       691
38  'm'       B      8       684
39  'm'       B      9       645
40  'm'       B      10      624

This is the code I use for computing the rolling sum.

df=df.reset_index().set_index(['Age'])
df=df.groupby(['Gender','Type'])['Value'].rolling(window=5,center=True).sum().reset_index()

That computes this:


    Gender    Type   Age    Value
1   'f'       A      1       NaN
2   'f'       A      2       NaN
3   'f'       A      3       3342
4   'f'       A      4       3338
5   'f'       A      5       3371
6   'f'       A      6       3376
7   'f'       A      7       3336
8   'f'       A      8       3339
9   'f'       A      9       NaN
10  'f'       A      10      NaN
11  'f'       B      1       NaN
12  'f'       B      2       NaN
13  'f'       B      3       3172
14  'f'       B      4       3191
15  'f'       B      5       3203
16  'f'       B      6       3212
17  'f'       B      7       3203
18  'f'       B      8       3204
19  'f'       B      9       NaN
20  'f'       B      10      NaN
21  'm'       A      1       NaN
22  'm'       A      2       NaN
23  'm'       A      3       x1
24  'm'       A      4       x2
25  'm'       A      5       x3
26  'm'       A      6       x4
27  'm'       A      7       x5
28  'm'       A      8       x7
29  'm'       A      9       NaN
30  'm'       A      10      NaN
31  'm'       B      1       NaN
32  'm'       B      2       NaN
33  'm'       B      3       x8
34  'm'       B      4       x9
35  'm'       B      5       x10
36  'm'       B      6       x11
37  'm'       B      7       x12
38  'm'       B      8       x13
39  'm'       B      9       NaN
40  'm'       B      10      NaN

The x's are just replacement for the rolling sums.

Now my problem. I want to replace the NaN values with specific cells within each group. Specifically, the rolling sum for 1 and 2 years in each group must be equal to that of 3 years. As 3 year row might also be NaN due to not meing computable, I can't use a code that just extrapolates forward and backwards a bfill or hfill. If 3 year-row is NaN I want the for 1 year and 2 year also within the group.

So the following result, is want I want:

    Gender    Type   Age    Value
1   'f'       A      1       3342
2   'f'       A      2       3342
3   'f'       A      3       3342
4   'f'       A      4       3338
5   'f'       A      5       3371
6   'f'       A      6       3376
7   'f'       A      7       3336
8   'f'       A      8       3339
9   'f'       A      9       3339
10  'f'       A      10      3339
11  'f'       B      1       3172
12  'f'       B      2       3172
13  'f'       B      3       3172
14  'f'       B      4       3191
15  'f'       B      5       3203
16  'f'       B      6       3212
17  'f'       B      7       3203
18  'f'       B      8       3204
19  'f'       B      9       3204
20  'f'       B      10      3204
21  'm'       A      1       x1
22  'm'       A      2       x1
23  'm'       A      3       x1
24  'm'       A      4       x2
25  'm'       A      5       x3
26  'm'       A      6       x4
27  'm'       A      7       x5
28  'm'       A      8       x7
29  'm'       A      9       x7
30  'm'       A      10      x7
31  'm'       B      1       x8
32  'm'       B      2       x8
33  'm'       B      3       x8
34  'm'       B      4       x9
35  'm'       B      5       x10
36  'm'       B      6       x11
37  'm'       B      7       x12
38  'm'       B      8       x13
39  'm'       B      9       x13
40  'm'       B      10      x13

I really hope, that one of you could help me. Thanks in advance.

After your initial groupby with rolling.sum , try groupby.transform with a customer def :

Setup

Make year 3 NaN for first group for testing

df.loc[2, 'Value'] = np.nan

print(df)

   Gender Type  Age   Value
0     'f'    A    1     NaN
1     'f'    A    2     NaN
2     'f'    A    3     NaN
3     'f'    A    4  3338.0
4     'f'    A    5  3371.0
5     'f'    A    6  3376.0
6     'f'    A    7  3336.0
7     'f'    A    8  3339.0
8     'f'    A    9     NaN
9     'f'    A   10     NaN
10    'f'    B    1     NaN
...

Solution

def custom_rolling_fillna(arr):
    arr.iloc[:2] = arr.iloc[2]
    arr.iloc[-2:] = arr.iloc[-3]
    return arr

df['Value'] = df.groupby(['Gender', 'Type'])['Value'].transform(custom_rolling_fillna)

print(df)

   Gender Type  Age   Value
0     'f'    A    1     NaN
1     'f'    A    2     NaN
2     'f'    A    3     NaN
3     'f'    A    4  3338.0
4     'f'    A    5  3371.0
5     'f'    A    6  3376.0
6     'f'    A    7  3336.0
7     'f'    A    8  3339.0
8     'f'    A    9  3339.0
9     'f'    A   10  3339.0
10    'f'    B    1  3172.0
...

Alternative, you could do this in one step using:

def custom_rolling_fillna(arr):
    rolling = arr.rolling(window=5,center=True).sum()
    rolling.iloc[:2] = arr.iloc[2]
    rolling.iloc[-2:] = arr.iloc[-3]    
    return rolling


df['Value'] = df.groupby(['Gender', 'Type'])['Value'].transform(custom_rolling_fillna)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM