简体   繁体   English

Python pandas:替换时间序列中的缺失值

[英]Python pandas: substitute missing values within a time series

I am currently working on a routine to process meteodata from different weatherstations.我目前正在制定一个程序来处理来自不同气象站的气象数据。 Unfortunetly there is missing data from time to time.不幸的是,有时会丢失数据。 I wrote a routine to merge data from all stations into a dataframe and implemented a "NaN_Flag" column which shows missing data of a certain timepoint.我编写了一个例程将所有站点的数据合并到一个数据帧中,并实现了一个“NaN_Flag”列,该列显示某个时间点的缺失数据。

Now the plan is to fill those data gaps with data of a nearby station.现在的计划是用附近站点的数据填补这些数据空白。 For temperature and humidity interpolation would be possible but depending on size of the gap not really ideal.对于温度和湿度插值是可能的,但取决于间隙的大小并不是很理想。 For rain events interpolation wouldnt make any sense.对于下雨事件,插值没有任何意义。

The first column is the index column containing date,time and location.第一列是包含日期、时间和位置的索引列。 Now I am looking for a solution to fill in meassured data of another location (same time) if the "NaN_Flag" shows "1".现在,如果“NaN_Flag”显示“1”,我正在寻找一种解决方案来填充另一个位置(同时)的测量数据。

So in the following simplified example I would like that the dataset of 01-01-01 00:20:00 of Location1 is automaticly replaced with data of the same datetime ofLocation2.因此,在下面的简化示例中,我希望Location1 的01-01-01 00:20:00 数据集自动替换为Location2 相同日期时间的数据。 So every Location has a "backup"-Location and everytime the "NaN_Flag" shows "1" data is automaticly replaced with the the appropriate backup data.所以每个位置都有一个“备份”-位置,每次“NaN_Flag”显示“1”时,数据都会自动替换为适当的备份数据。 Anyone got any idea how to accomplish that?任何人都知道如何做到这一点?

DATETIME_UTC_LOCATION           DATETIME_UTC              LOCATION    TEMP   PLUV   HUM   NaN_FLAG 
2020-01-01 00:00:00 Location1   2020-01-01 00:00:00       Location1   5.25   0.0    87.3  0
2020-01-01 00:10:00 Location1   2020-01-01 00:10:00       Location1   6.12   0.1    85.0  0
2020-01-01 00:20:00 Location1   2020-01-01 00:20:00       Location1                       1
2020-01-01 00:00:00 Location2   2020-01-01 00:00:00       Location2   5.12   0.0    88.9  0
2020-01-01 00:10:00 Location2   2020-01-01 00:10:00       Location2   6.25   0.1    84.3  0
2020-01-01 00:20:00 Location2   2020-01-01 00:20:00       Location2   6.75   0.2    82.5  0

If the dataframe you have has an equivalent format to this one:如果您拥有的数据框具有与此等效的格式:

import pandas as pd
import numpy as np


df = pd.DataFrame(data={'month': ["Jan","Feb","Mar","Jan","Feb","Mar"],
                        'station': ["station_1","station_1","station_1","station_2","station_2","station_2"],
                        'values': [3.2, np.nan, 4.1, 3.6, 5.8, 4.2]}).set_index('month')

output:输出:

              station   values
    month       
    Jan       station_1    3.2
    Feb       station_1    NaN
    Mar       station_1    4.1
    Jan       station_2    3.6
    Feb       station_2    5.8
    Mar       station_2    4.2

You can use:您可以使用:

df.loc[df['station'] == "station_1"] = df.loc[df['station'] == "station_1"].fillna(df.loc[df['station'] == "station_2"])

to substitute NaN values of the station 1 for the station 2 equivalent values.将站 1 的 NaN 值替换为站 2 的等效值。 By "equivalent" I mean matching in the "month" index. “等效”是指在“月”索引中匹配。

Output:输出:

              station   values
    month       
    Jan       station_1    3.2
    Feb       station_1    5.8
    Mar       station_1    4.1
    Jan       station_2    3.6
    Feb       station_2    5.8
    Mar       station_2    4.2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM