Pandas 基于日期时间索引重新排列和插入时间序列

Question

我有一个反复出现的问题，我每次都无法优雅地解决它，我无法找到一个好的方法来解决它。 假设我在索引中有一个日期时间的数据框，跨越每 3 小时 (df1)。 我每天都有另一个数据帧（df2）。

我想做两件事：

通过计算一天中每 3 小时周期的平均值，对 df1 重新采样以跨越每天而不是每 3 小时。
为将丢失的任何一天插入 df2，并将该天添加到它所属的位置。

问题：我使用 for 循环（并希望避免这种情况）并且对缺失天数的重新采样不完整（只能属性 1 个值）。

这就是我的做法：

import numpy as np
import pandas as pd
from datetime import *

# Create df1
rng = pd.date_range('2000-01-01', periods=365*(24/3), freq='3H')
df1 = pd.DataFrame({'Val': np.random.randn(len(rng)) }, index = rng)

# Create df2 and drop a few rows
rng2 = pd.date_range('2000-01-01', periods=365, freq='D')
df2 = pd.DataFrame({'Val': np.random.randn(len(rng2)) },index = rng2)
df2 = df2.drop([datetime(2000,1,5),datetime(2000,1,24)])

# Create reference timelist 
date_list = [datetime(2000,1,1) + timedelta(days=x) for x in range(365)]


# Calculate the daily mean of df1:
# We create an array hosting the resampled values of df1
arr = []
c = 1

# Loop that appends the array everytime we hit a new day, and calculate a mean of the day that passed
for i in range(1,len(df1)):

    if c < 365 and df1.index[i] == date_list[c]:
        arr.append(np.mean(df1[i-8:i])[0])
        c = c + 1

# Calculate the last value of the array
arr.append(np.mean(df1[i-7:i+1])[0])

# Create a new dataframe hosting the daily values from df1
df3 = pd.DataFrame({'Val': arr}, index = rng2)


# Replace missing days in df2
df2 = df2.reindex(date_list, fill_value=0)
df2 = df2.resample('D').interpolate(method='linear') # but this does not work

Answer 1

我认为这两个问题都有两个简单的修复方法； 您只需要更新对两者的resample使用。

第一点：只需重新采样

您的第一点正是使用resample进行下resample 。 您可以将整个df3创建替换为：

df1.resample('D').mean()

这将平均每天所有 3 小时的时间段。 为了确认，我们可以检查您的结果是否与我提出的相同：

>>> all(df1.resample('D').mean().round(8) == df3.round(8))
True

请注意，我必须四舍五入，因为您的代码和resample之间存在浮点错误； 但他们非常接近。

第二点：不要先重新索引

当您在第二种情况下进行插值以填补缺失的天数时，您仍然希望有缺失的天数来填补！ AKA，如果您首先reindex并用0填充值，则插值“失败”，因为它找不到任何要插值的内容。 因此，如果我正确reindex您的问题，您只想删除reindex行：

# df2 = df2.reindex(date_list, fill_value=0)
df2 = df2.resample('D').interpolate(method='linear')

因此，如果您像这样从df2开始：

>>> df.head(10)
                 Val
2000-01-01  0.235151
2000-01-02  1.279017
2000-01-03 -1.267074
2000-01-04 -0.270182 # the fifth is missing
2000-01-06  0.382649
2000-01-07  0.120253
2000-01-08 -0.223690
2000-01-09  1.379003
2000-01-10 -0.477681
2000-01-11  0.619466

你以这个结束：

>>> df2.head(10)
                 Val
2000-01-01  0.235151
2000-01-02  1.279017
2000-01-03 -1.267074
2000-01-04 -0.270182
2000-01-05  0.056233 # the fifth is here, halfway between 4th and 6th
2000-01-06  0.382649
2000-01-07  0.120253
2000-01-08 -0.223690
2000-01-09  1.379003
2000-01-10 -0.477681

Pandas 基于日期时间索引重新排列和插入时间序列

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-10-22 18:37:12

第一点：只需重新采样

第二点：不要先重新索引

Pandas 基于日期时间索引重新排列和插入时间序列

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-10-22 18:37:12

第一点：只需重新采样

第二点：不要先重新索引

解决方案1
2 已采纳 2021-10-22 18:37:12