[英]Python: upsampling dataframe from daily to hourly data using ffill()
I'm trying to upsample my data from daily to hourly frequency and forward fill missing data.我正在尝试将我的数据从每日频率上采样到每小时频率并向前填充缺失的数据。
I start with the following code:我从以下代码开始:
df1 = pd.read_csv("DATA.csv")
df1.head(5)
I then used the following to convert to a datetime string and set the date/time as an index:然后我使用以下内容转换为日期时间字符串并将日期/时间设置为索引:
df1['DT'] = pd.to_datetime(df1['DT']).dt.strftime('%Y-%m-%d %H:%M:%S')
df1.set_index('DT')
I try to resample hourly as follows:我尝试每小时重新采样如下:
df1['DT'] = df1.resample('H').ffill()
But I get the following error:但我收到以下错误:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
类型错误:仅对 DatetimeIndex、TimedeltaIndex 或 PeriodIndex 有效,但得到了“RangeIndex”的实例
I thought my dtype was already date time as instructed by the pd.to_datetime code above.我认为我的 dtype 已经是上面 pd.to_datetime 代码指示的日期时间。 Nothing I try seems to be working.
我尝试的任何东西似乎都不起作用。 Can anyone please help me?
谁能帮帮我吗?
My expected output is as follows:我的预期输出如下:
DT VALUE
2016-08-01 00:00:00 0.000000
2016-08-01 01:00:00 0.000000
2016-08-01 02:00:00 0.000000
etc.等等。
The file itself has approximately 1000 rows.文件本身大约有 1000 行。 The first 50 rows or so are zero so to clarify where there's actual data:
前 50 行左右为零,以便阐明实际数据的位置:
DT VALUE
2018-12-13 00:00:00 24000.000000
2018-12-13 01:00:00 24000.000000
2018-12-13 02:00:00 24000.000000
...
2018-12-13 23:00:00 24000.000000
2018-12-14 00:00:00 26000.000000
2018-12-14 01:00:00 26000.000000
etc.等等。
Try assign it back尝试将其分配回来
df1=df1.set_index('DT')
Or或者
df1.set_index('DT',inplace=True)
I am assuming some initial rows of your dataset as you mentioned,我假设你提到的数据集的一些初始行,
DT VALUE
0 2016-08-01 0
1 2016-08-02 0
2 2016-08-03 0
3 2016-08-04 0
4 2016-08-05 0
5 2016-08-06 0
6 2016-08-07 0
7 2016-08-08 0
8 2016-08-09 0
Then, make index on DT
like this,然后,像这样在
DT
索引,
df = df.set_index('DT')
df
Output:输出:
VALUE
DT
2016-08-01 0
2016-08-02 0
2016-08-03 0
2016-08-04 0
2016-08-05 0
2016-08-06 0
2016-08-07 0
2016-08-08 0
2016-08-09 0
Now, resample your dataframe,现在,重新采样您的数据框,
df = df.resample('H').ffill()
df
Output: showing some initial values of output,输出:显示输出的一些初始值,
VALUE
DT
2016-08-01 00:00:00 0
2016-08-01 01:00:00 0
2016-08-01 02:00:00 0
2016-08-01 03:00:00 0
2016-08-01 04:00:00 0
2016-08-01 05:00:00 0
2016-08-01 06:00:00 0
2016-08-01 07:00:00 0
2016-08-01 08:00:00 0
2016-08-01 09:00:00 0
2016-08-01 10:00:00 0
You could convert the index to a pd.DatetimeIndex
and then resample that.您可以将索引转换为
pd.DatetimeIndex
然后重新采样。 I also don't think you need (or want) the strftime()
call:我也不认为你需要(或想要)
strftime()
调用:
df1 = pd.read_csv("DATA.csv")
df1['DT'] = pd.to_datetime(df1['DT'])
df1.set_index('DT')
df1.index = pd.DatetimeIndex(df1.index)
df1['DT'] = df1.resample('H').ffill()
NOTE: You could probably combine a bunch of this and it would still be quite clear, like:注意:您可能可以结合一堆这样的内容,它仍然会很清楚,例如:
df1 = pd.read_csv("DATA.csv")
df1.index = pd.DatetimeIndex(pd.to_datetime(df1['DT']))
df1['DT'] = df1.resample('H').ffill()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.