简体   繁体   English

按NaN填充datetimeindex缺口

[英]Fill datetimeindex gap by NaN

I have two dataframes which are datetimeindexed. 我有两个数据帧是datetimeindexed。 One is missing a few of these datetimes ( df1 ) while the other is complete (has regular timestamps without any gaps in this series) and is full of NaN 's ( df2 ). 一个是缺少一些这些日期时间( df1 )而另一个是完整的(有这个系列中没有任何间隙的常规时间戳)并且充满了NaNdf2 )。

I'm trying to match the values from df1 to the index of df2 , filling with NaN 's where such a datetimeindex doesn't exist in df1 . 我试图将df1中的值与df2的索引相匹配,填充NaN ,其中df1中不存在这样的datetimeindex

Example: 例:

In  [51]: df1
Out [51]:                       value
          2015-01-01 14:00:00   20
          2015-01-01 15:00:00   29
          2015-01-01 16:00:00   41
          2015-01-01 17:00:00   43
          2015-01-01 18:00:00   26
          2015-01-01 19:00:00   20
          2015-01-01 20:00:00   31
          2015-01-01 21:00:00   35
          2015-01-01 22:00:00   39
          2015-01-01 23:00:00   17
          2015-03-01 00:00:00   6
          2015-03-01 01:00:00   37
          2015-03-01 02:00:00   56
          2015-03-01 03:00:00   12
          2015-03-01 04:00:00   41
          2015-03-01 05:00:00   31
          ...   ...

          2018-12-25 23:00:00   41

          <34843 rows × 1 columns>

In  [52]: df2 = pd.DataFrame(data=None, index=pd.date_range(freq='60Min', start=df1.index.min(), end=df1.index.max()))
          df2['value']=np.NaN
          df2
Out [52]:                       value
          2015-01-01 14:00:00   NaN
          2015-01-01 15:00:00   NaN
          2015-01-01 16:00:00   NaN
          2015-01-01 17:00:00   NaN
          2015-01-01 18:00:00   NaN
          2015-01-01 19:00:00   NaN
          2015-01-01 20:00:00   NaN
          2015-01-01 21:00:00   NaN
          2015-01-01 22:00:00   NaN
          2015-01-01 23:00:00   NaN
          2015-01-02 00:00:00   NaN
          2015-01-02 01:00:00   NaN
          2015-01-02 02:00:00   NaN
          2015-01-02 03:00:00   NaN
          2015-01-02 04:00:00   NaN
          2015-01-02 05:00:00   NaN
          ...                   ...
          2018-12-25 23:00:00   NaN

          <34906 rows × 1 columns>

Using df2.combine_first(df1) returns the same data as df1.reindex(index= df2.index) , which fills any gaps where there shouldn't be data with some value, instead of NaN. 使用df2.combine_first(df1)返回与df1.reindex(index= df2.index)相同的数据,它填补了不应该存在具有某些值的数据的任何间隙,而不是NaN。

In  [53]: Result = df2.combine_first(df1)
          Result
Out [53]:                       value
          2015-01-01 14:00:00   20
          2015-01-01 15:00:00   29
          2015-01-01 16:00:00   41
          2015-01-01 17:00:00   43
          2015-01-01 18:00:00   26
          2015-01-01 19:00:00   20
          2015-01-01 20:00:00   31
          2015-01-01 21:00:00   35
          2015-01-01 22:00:00   39
          2015-01-01 23:00:00   17
          2015-01-02 00:00:00   35
          2015-01-02 01:00:00   53
          2015-01-02 02:00:00   28
          2015-01-02 03:00:00   48
          2015-01-02 04:00:00   42
          2015-01-02 05:00:00   51
          ...                   ...
          2018-12-25 23:00:00   41

          <34906 rows × 1 columns>

This is what I was hoping to get: 这是我希望得到的:

Out [53]:                       value
          2015-01-01 14:00:00   20
          2015-01-01 15:00:00   29
          2015-01-01 16:00:00   41
          2015-01-01 17:00:00   43
          2015-01-01 18:00:00   26
          2015-01-01 19:00:00   20
          2015-01-01 20:00:00   31
          2015-01-01 21:00:00   35
          2015-01-01 22:00:00   39
          2015-01-01 23:00:00   17
          2015-01-02 00:00:00   NaN
          2015-01-02 01:00:00   NaN
          2015-01-02 02:00:00   NaN
          2015-01-02 03:00:00   NaN
          2015-01-02 04:00:00   NaN
          2015-01-02 05:00:00   NaN
          ...                   ...
          2018-12-25 23:00:00   41

          <34906 rows × 1 columns>

Could someone shed some light on why this is happening, and how to set how these values are filled? 有人可以解释为什么会发生这种情况,以及如何设置这些值的填充方式?

IIUC you need resample df1 , because you have an irregular frequency and you need regular frequency: IIUC你需要resample df1 ,因为你有一个不规则的frequency ,你需要定期频率:

print df1.index.freq
None

print Result.index.freq
<60 * Minutes>

EDIT1 EDIT1
You can use function asfreq instead of resample - doc , resample vs asfreq . 您可以使用函数asfreq而不是resample - docresample vs asfreq

EDIT2 EDIT2
First I think that resample didn't work, because after resampling the Result is the same as df1 . 首先我认为resample不起作用,因为重新采样后Resultdf1相同。 But I try print df1.info() and print Result.info() gets different results - 34857 entries vs 34920 entries . 但我尝试print df1.info()print Result.info()获得不同的结果 - 34857 entries34920 entries So I try to find rows with NaN values and it returns 63 rows . 所以我尝试找到具有NaN值的行,并返回63 rows

So I think resample works well. 所以我认为resample效果很好。

import pandas as pd

df1 = pd.read_csv('test/GapInTimestamps.csv', sep=",", index_col=[0], parse_dates=[0])
print df1.head()

#                     value
#Date/Time                 
#2015-01-01 00:00:00     52
#2015-01-01 01:00:00      5
#2015-01-01 02:00:00     12
#2015-01-01 03:00:00     54
#2015-01-01 04:00:00     47
print df1.info()

#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 34857 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00
#Data columns (total 1 columns):
#value    34857 non-null int64
#dtypes: int64(1)
#memory usage: 544.6 KB
#None

Result  = df1.resample('60min')
print Result.head()

#                     value
#Date/Time                 
#2015-01-01 00:00:00     52
#2015-01-01 01:00:00      5
#2015-01-01 02:00:00     12
#2015-01-01 03:00:00     54
#2015-01-01 04:00:00     47
print Result.info()

#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 34920 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00
#Freq: 60T
#Data columns (total 1 columns):
#value    34857 non-null float64
#dtypes: float64(1)
#memory usage: 545.6 KB
#None

#find values with NaN
resultnan =  Result[Result.isnull().any(axis=1)]
#temporaly display 999 rows and 15 columns
with pd.option_context('display.max_rows', 999, 'display.max_columns', 15):
    print resultnan

#                     value
#Date/Time                 
#2015-01-13 19:00:00    NaN
#2015-01-13 20:00:00    NaN
#2015-01-13 21:00:00    NaN
#2015-01-13 22:00:00    NaN
#2015-01-13 23:00:00    NaN
#2015-01-14 00:00:00    NaN
#2015-01-14 01:00:00    NaN
#2015-01-14 02:00:00    NaN
#2015-01-14 03:00:00    NaN
#2015-01-14 04:00:00    NaN
#2015-01-14 05:00:00    NaN
#2015-01-14 06:00:00    NaN
#2015-01-14 07:00:00    NaN
#2015-01-14 08:00:00    NaN
#2015-01-14 09:00:00    NaN
#2015-02-01 00:00:00    NaN
#2015-02-01 01:00:00    NaN
#2015-02-01 02:00:00    NaN
#2015-02-01 03:00:00    NaN
#2015-02-01 04:00:00    NaN
#2015-02-01 05:00:00    NaN
#2015-02-01 06:00:00    NaN
#2015-02-01 07:00:00    NaN
#2015-02-01 08:00:00    NaN
#2015-02-01 09:00:00    NaN
#2015-02-01 10:00:00    NaN
#2015-02-01 11:00:00    NaN
#2015-02-01 12:00:00    NaN
#2015-02-01 13:00:00    NaN
#2015-02-01 14:00:00    NaN
#2015-02-01 15:00:00    NaN
#2015-02-01 16:00:00    NaN
#2015-02-01 17:00:00    NaN
#2015-02-01 18:00:00    NaN
#2015-02-01 19:00:00    NaN
#2015-02-01 20:00:00    NaN
#2015-02-01 21:00:00    NaN
#2015-02-01 22:00:00    NaN
#2015-02-01 23:00:00    NaN
#2015-11-01 00:00:00    NaN
#2015-11-01 01:00:00    NaN
#2015-11-01 02:00:00    NaN
#2015-11-01 03:00:00    NaN
#2015-11-01 04:00:00    NaN
#2015-11-01 05:00:00    NaN
#2015-11-01 06:00:00    NaN
#2015-11-01 07:00:00    NaN
#2015-11-01 08:00:00    NaN
#2015-11-01 09:00:00    NaN
#2015-11-01 10:00:00    NaN
#2015-11-01 11:00:00    NaN
#2015-11-01 12:00:00    NaN
#2015-11-01 13:00:00    NaN
#2015-11-01 14:00:00    NaN
#2015-11-01 15:00:00    NaN
#2015-11-01 16:00:00    NaN
#2015-11-01 17:00:00    NaN
#2015-11-01 18:00:00    NaN
#2015-11-01 19:00:00    NaN
#2015-11-01 20:00:00    NaN
#2015-11-01 21:00:00    NaN
#2015-11-01 22:00:00    NaN
#2015-11-01 23:00:00    NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM