简体   繁体   English

使用 Pandas 将 NaN 替换为平均值

[英]Replacing NaNs with Mean Value using Pandas

Say I have a Dataframe called Data with shape (71067, 4) :假设我有一个名为Data的 Dataframe,形状为(71067, 4)

       StartTime          EndDateTime        TradeDate  Values
0   2018-12-31 23:00:00 2018-12-31 23:30:00 2019-01-01  -44.676
1   2018-12-31 23:30:00 2019-01-01 00:00:00 2019-01-01  -36.113
2   2019-01-01 00:00:00 2019-01-01 00:30:00 2019-01-01  -19.229
3   2019-01-01 00:30:00 2019-01-01 01:00:00 2019-01-01  -23.606
4   2019-01-01 01:00:00 2019-01-01 01:30:00 2019-01-01  -25.899
... ... ... ... ...
    2023-01-30 20:30:00 2023-01-30 21:00:00 2023-01-30  -27.198
    2023-01-30 21:00:00 2023-01-30 21:30:00 2023-01-30  -13.221
    2023-01-30 21:30:00 2023-01-30 22:00:00 2023-01-30  -12.034
    2023-01-30 22:00:00 2023-01-30 22:30:00 2023-01-30  -16.464
    2023-01-30 22:30:00 2023-01-30 23:00:00 2023-01-30  -25.441
71067 rows × 4 columns

When running Data.isna().sum().sum() I realise I have some NaN values in the dataset:运行Data.isna().sum().sum()时,我意识到我在数据集中有一些 NaN 值:

Data.isna().sum().sum()
> 1391

Shown here:显示在这里:

Data[Data['Values'].isna()].reset_index(drop = True).sort_values(by = 'StartTime')

0   2019-01-01 03:30:00 2019-01-01 04:00:00 2019-01-01  NaN
1   2019-01-04 02:30:00 2019-01-04 03:00:00 2019-01-04  NaN
2   2019-01-04 03:00:00 2019-01-04 03:30:00 2019-01-04  NaN
3   2019-01-04 03:30:00 2019-01-04 04:00:00 2019-01-04  NaN
4   2019-01-04 04:00:00 2019-01-04 04:30:00 2019-01-04  NaN
... ... ... ... ...
1386    2022-12-06 13:00:00 2022-12-06 13:30:00 2022-12-06  NaN
1387    2022-12-06 13:30:00 2022-12-06 14:00:00 2022-12-06  NaN
1388    2022-12-22 11:00:00 2022-12-22 11:30:00 2022-12-22  NaN
1389    2023-01-25 11:00:00 2023-01-25 11:30:00 2023-01-25  NaN
1390    2023-01-25 11:30:00 2023-01-25 12:00:00 2023-01-25  NaN

Is there anyway of replacing each of the NaN values in the dataset with the mean value of the corresponding half hour across the 70,000 plus rows, see below:无论如何,是否可以将数据集中的每个 NaN 值替换为 70,000 多行中相应半小时的平均值,见下文:

Data['HH'] = pd.to_datetime(Data['StartTime']).dt.time
Data.groupby(['HH'], as_index=False)[['Data']].mean().head(10)
# Only showing first 10 means

    HH          Values
0   00:00:00    5.236811
1   00:30:00    2.056571
2   01:00:00    4.157455
3   01:30:00    2.339253
4   02:00:00    2.658238
5   02:30:00    0.230557
6   03:00:00    0.217599
7   03:30:00    -0.630243
8   04:00:00    -0.989919
9   04:30:00    -0.494372

For example, if a value is missing against 04:00, can it be replaced with the 04:00 mean value (0.989919) as per the above table of means?例如,如果 04:00 缺少一个值,是否可以根据上述均值表将其替换为 04:00 的均值(0.989919)

Any help greatly appreciated.非常感谢任何帮助。

Let's group the dataframe by HH then transform the Values with mean to broadcast the mean values back to the original column shape then use fillna to fill the null values让我们按HH对 dataframe 进行分组,然后使用mean转换Values以将平均值传播回原始列形状,然后使用fillna填充 null 值

avg = Data.groupby('HH')['Values'].transform('mean')
Data['Values'] = Data['Values'].fillna(avg)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM