[英]Pandas rearrange and interpolate time-series based with datetime index
I have a recurrent problem that I solve inelegantly every-time, and I am unable to find a good way to do it.我有一个反复出现的问题,我每次都无法优雅地解决它,我无法找到一个好的方法来解决它。 Let's say I have a dataframe with datetime in index, spanning every 3 hours (df1).
假设我在索引中有一个日期时间的数据框,跨越每 3 小时 (df1)。 I have another dataframe spanning every day (df2).
我每天都有另一个数据帧(df2)。
I want to do 2 things:我想做两件事:
Issues: I use for loops (and want to avoid that) and the resampling of missing days is incomplete (can only attribute 1 value).问题:我使用 for 循环(并希望避免这种情况)并且对缺失天数的重新采样不完整(只能属性 1 个值)。
This is how I was doing it:这就是我的做法:
import numpy as np
import pandas as pd
from datetime import *
# Create df1
rng = pd.date_range('2000-01-01', periods=365*(24/3), freq='3H')
df1 = pd.DataFrame({'Val': np.random.randn(len(rng)) }, index = rng)
# Create df2 and drop a few rows
rng2 = pd.date_range('2000-01-01', periods=365, freq='D')
df2 = pd.DataFrame({'Val': np.random.randn(len(rng2)) },index = rng2)
df2 = df2.drop([datetime(2000,1,5),datetime(2000,1,24)])
# Create reference timelist
date_list = [datetime(2000,1,1) + timedelta(days=x) for x in range(365)]
# Calculate the daily mean of df1:
# We create an array hosting the resampled values of df1
arr = []
c = 1
# Loop that appends the array everytime we hit a new day, and calculate a mean of the day that passed
for i in range(1,len(df1)):
if c < 365 and df1.index[i] == date_list[c]:
arr.append(np.mean(df1[i-8:i])[0])
c = c + 1
# Calculate the last value of the array
arr.append(np.mean(df1[i-7:i+1])[0])
# Create a new dataframe hosting the daily values from df1
df3 = pd.DataFrame({'Val': arr}, index = rng2)
# Replace missing days in df2
df2 = df2.reindex(date_list, fill_value=0)
df2 = df2.resample('D').interpolate(method='linear') # but this does not work
I think there are two simple fixes for both these issues;我认为这两个问题都有两个简单的修复方法; you just need to update your use of
resample
for both.您只需要更新对两者的
resample
使用。
Your first point is precisely a case of downsampling with resample
.您的第一点正是使用
resample
进行下resample
。 You can replace your whole creation of df3
with:您可以将整个
df3
创建替换为:
df1.resample('D').mean()
This is going to average all the 3 hour periods for each day.这将平均每天所有 3 小时的时间段。 For confirmation, we can check that your results are the same as what I am proposing:
为了确认,我们可以检查您的结果是否与我提出的相同:
>>> all(df1.resample('D').mean().round(8) == df3.round(8))
True
Note that I have to round because there are floating point errors between your code and resample
;请注意,我必须四舍五入,因为您的代码和
resample
之间存在浮点错误; but they are extremely close.但他们非常接近。
When you interpolate in the second case to fill the missing days, you want to still have the missing days to fill!当您在第二种情况下进行插值以填补缺失的天数时,您仍然希望有缺失的天数来填补! AKA, if you
reindex
first and fill the value with 0
, the interpolation "fails" because it doesn't find anything to interpolate. AKA,如果您首先
reindex
并用0
填充值,则插值“失败”,因为它找不到任何要插值的内容。 So if I get your issue correctly, you just want to remove the reindex
line:因此,如果我正确
reindex
您的问题,您只想删除reindex
行:
# df2 = df2.reindex(date_list, fill_value=0)
df2 = df2.resample('D').interpolate(method='linear')
So if you start with df2
like this:因此,如果您像这样从
df2
开始:
>>> df.head(10)
Val
2000-01-01 0.235151
2000-01-02 1.279017
2000-01-03 -1.267074
2000-01-04 -0.270182 # the fifth is missing
2000-01-06 0.382649
2000-01-07 0.120253
2000-01-08 -0.223690
2000-01-09 1.379003
2000-01-10 -0.477681
2000-01-11 0.619466
You end with this:你以这个结束:
>>> df2.head(10)
Val
2000-01-01 0.235151
2000-01-02 1.279017
2000-01-03 -1.267074
2000-01-04 -0.270182
2000-01-05 0.056233 # the fifth is here, halfway between 4th and 6th
2000-01-06 0.382649
2000-01-07 0.120253
2000-01-08 -0.223690
2000-01-09 1.379003
2000-01-10 -0.477681
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.