Pandas 基于日期时间索引重新排列和插入时间序列

Question

I have a recurrent problem that I solve inelegantly every-time, and I am unable to find a good way to do it.我有一个反复出现的问题，我每次都无法优雅地解决它，我无法找到一个好的方法来解决它。 Let's say I have a dataframe with datetime in index, spanning every 3 hours (df1).假设我在索引中有一个日期时间的数据框，跨越每 3 小时 (df1)。 I have another dataframe spanning every day (df2).我每天都有另一个数据帧（df2）。

I want to do 2 things:我想做两件事：

Resample the df1 to span everyday instead of every 3h, by calculating a mean of each 3h periods over a day.通过计算一天中每 3 小时周期的平均值，对 df1 重新采样以跨越每天而不是每 3 小时。
Interpolate df2 for any day that would be missing, and add that day where it belongs.为将丢失的任何一天插入 df2，并将该天添加到它所属的位置。

Issues: I use for loops (and want to avoid that) and the resampling of missing days is incomplete (can only attribute 1 value).问题：我使用 for 循环（并希望避免这种情况）并且对缺失天数的重新采样不完整（只能属性 1 个值）。

This is how I was doing it:这就是我的做法：

import numpy as np
import pandas as pd
from datetime import *

# Create df1
rng = pd.date_range('2000-01-01', periods=365*(24/3), freq='3H')
df1 = pd.DataFrame({'Val': np.random.randn(len(rng)) }, index = rng)

# Create df2 and drop a few rows
rng2 = pd.date_range('2000-01-01', periods=365, freq='D')
df2 = pd.DataFrame({'Val': np.random.randn(len(rng2)) },index = rng2)
df2 = df2.drop([datetime(2000,1,5),datetime(2000,1,24)])

# Create reference timelist 
date_list = [datetime(2000,1,1) + timedelta(days=x) for x in range(365)]


# Calculate the daily mean of df1:
# We create an array hosting the resampled values of df1
arr = []
c = 1

# Loop that appends the array everytime we hit a new day, and calculate a mean of the day that passed
for i in range(1,len(df1)):

    if c < 365 and df1.index[i] == date_list[c]:
        arr.append(np.mean(df1[i-8:i])[0])
        c = c + 1

# Calculate the last value of the array
arr.append(np.mean(df1[i-7:i+1])[0])

# Create a new dataframe hosting the daily values from df1
df3 = pd.DataFrame({'Val': arr}, index = rng2)


# Replace missing days in df2
df2 = df2.reindex(date_list, fill_value=0)
df2 = df2.resample('D').interpolate(method='linear') # but this does not work

Answer 1

I think there are two simple fixes for both these issues;我认为这两个问题都有两个简单的修复方法； you just need to update your use of resample for both.您只需要更新对两者的resample使用。

First point: just resample第一点：只需重新采样

Your first point is precisely a case of downsampling with resample .您的第一点正是使用resample进行下resample 。 You can replace your whole creation of df3 with:您可以将整个df3创建替换为：

df1.resample('D').mean()

This is going to average all the 3 hour periods for each day.这将平均每天所有 3 小时的时间段。 For confirmation, we can check that your results are the same as what I am proposing:为了确认，我们可以检查您的结果是否与我提出的相同：

>>> all(df1.resample('D').mean().round(8) == df3.round(8))
True

Note that I have to round because there are floating point errors between your code and resample ;请注意，我必须四舍五入，因为您的代码和resample之间存在浮点错误； but they are extremely close.但他们非常接近。

Second point: don't reindex first第二点：不要先重新索引

When you interpolate in the second case to fill the missing days, you want to still have the missing days to fill!当您在第二种情况下进行插值以填补缺失的天数时，您仍然希望有缺失的天数来填补！ AKA, if you reindex first and fill the value with 0 , the interpolation "fails" because it doesn't find anything to interpolate. AKA，如果您首先reindex并用0填充值，则插值“失败”，因为它找不到任何要插值的内容。 So if I get your issue correctly, you just want to remove the reindex line:因此，如果我正确reindex您的问题，您只想删除reindex行：

# df2 = df2.reindex(date_list, fill_value=0)
df2 = df2.resample('D').interpolate(method='linear')

So if you start with df2 like this:因此，如果您像这样从df2开始：

>>> df.head(10)
                 Val
2000-01-01  0.235151
2000-01-02  1.279017
2000-01-03 -1.267074
2000-01-04 -0.270182 # the fifth is missing
2000-01-06  0.382649
2000-01-07  0.120253
2000-01-08 -0.223690
2000-01-09  1.379003
2000-01-10 -0.477681
2000-01-11  0.619466

You end with this:你以这个结束：

>>> df2.head(10)
                 Val
2000-01-01  0.235151
2000-01-02  1.279017
2000-01-03 -1.267074
2000-01-04 -0.270182
2000-01-05  0.056233 # the fifth is here, halfway between 4th and 6th
2000-01-06  0.382649
2000-01-07  0.120253
2000-01-08 -0.223690
2000-01-09  1.379003
2000-01-10 -0.477681

Pandas 基于日期时间索引重新排列和插入时间序列

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-10-22 18:37:12

First point: just resample第一点：只需重新采样

Second point: don't reindex first第二点：不要先重新索引

Pandas 基于日期时间索引重新排列和插入时间序列

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-10-22 18:37:12

First point: just resample第一点：只需重新采样

Second point: don't reindex first第二点：不要先重新索引

解决方案1
2 已采纳 2021-10-22 18:37:12