簡體   English   中英

如何使用日期時間索引將df重新采樣到n個相同大小的時段?

[英]How to resample a df with datetime index to exactly n equally sized periods?

我有一個帶有日期時間索引的大型數據框,需要將數據重新采樣到10個相同大小的周期。

到目前為止,我已經嘗試找到第一個和最后一個日期來確定數據中的總天數,將其除以10以確定每個期間的大小,然后使用該天數重新采樣。 例如:

first = df.reset_index().timesubmit.min()
last = df.reset_index().timesubmit.max()
periodsize = str((last-first).days/10) + 'D'

df.resample(periodsize,how='sum')

由於周期大小是向下舍入的int,因此這不能保證重新采樣后df中的10個周期。 使用浮動在重采樣中不起作用。 似乎我在這里缺少一些簡單的東西,或者我正在解決問題。

這是通過在pd.Timedelta上使用np.linspace()然后使用pd.cut將每個obs分類到不同的bin中來確保相等大小的子句點的一種方法。

import pandas as pd
import numpy as np

# generate artificial data
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 2), columns=['A', 'B'], index=pd.date_range('2015-01-01 00:00:00', periods=100, freq='8H'))

Out[87]: 
                          A       B
2015-01-01 00:00:00  1.7641  0.4002
2015-01-01 08:00:00  0.9787  2.2409
2015-01-01 16:00:00  1.8676 -0.9773
2015-01-02 00:00:00  0.9501 -0.1514
2015-01-02 08:00:00 -0.1032  0.4106
2015-01-02 16:00:00  0.1440  1.4543
2015-01-03 00:00:00  0.7610  0.1217
2015-01-03 08:00:00  0.4439  0.3337
2015-01-03 16:00:00  1.4941 -0.2052
2015-01-04 00:00:00  0.3131 -0.8541
2015-01-04 08:00:00 -2.5530  0.6536
2015-01-04 16:00:00  0.8644 -0.7422
2015-01-05 00:00:00  2.2698 -1.4544
2015-01-05 08:00:00  0.0458 -0.1872
2015-01-05 16:00:00  1.5328  1.4694
...                     ...     ...
2015-01-29 08:00:00  0.9209  0.3187
2015-01-29 16:00:00  0.8568 -0.6510
2015-01-30 00:00:00 -1.0342  0.6816
2015-01-30 08:00:00 -0.8034 -0.6895
2015-01-30 16:00:00 -0.4555  0.0175
2015-01-31 00:00:00 -0.3540 -1.3750
2015-01-31 08:00:00 -0.6436 -2.2234
2015-01-31 16:00:00  0.6252 -1.6021
2015-02-01 00:00:00 -1.1044  0.0522
2015-02-01 08:00:00 -0.7396  1.5430
2015-02-01 16:00:00 -1.2929  0.2671
2015-02-02 00:00:00 -0.0393 -1.1681
2015-02-02 08:00:00  0.5233 -0.1715
2015-02-02 16:00:00  0.7718  0.8235
2015-02-03 00:00:00  2.1632  1.3365

[100 rows x 2 columns]


# cutoff points, 10 equal-size group requires 11 points
# measured by timedelta 1 hour
time_delta_in_hours = (df.index - df.index[0]) / pd.Timedelta('1h')
n = 10
ts_cutoff = np.linspace(0, time_delta_in_hours[-1], n+1)
# labels, time index
time_index = df.index[0] + np.array([pd.Timedelta(str(time_delta)+'h') for time_delta in ts_cutoff])

# create a categorical reference variables
df['start_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[:-1])
# for clarity, reassign labels using end-period index
df['end_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[1:])

Out[89]: 
                          A       B    start_time_index      end_time_index
2015-01-01 00:00:00  1.7641  0.4002 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 08:00:00  0.9787  2.2409 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 16:00:00  1.8676 -0.9773 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 00:00:00  0.9501 -0.1514 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 08:00:00 -0.1032  0.4106 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 16:00:00  0.1440  1.4543 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 00:00:00  0.7610  0.1217 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 08:00:00  0.4439  0.3337 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 16:00:00  1.4941 -0.2052 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 00:00:00  0.3131 -0.8541 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 08:00:00 -2.5530  0.6536 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-04 16:00:00  0.8644 -0.7422 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 00:00:00  2.2698 -1.4544 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 08:00:00  0.0458 -0.1872 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 16:00:00  1.5328  1.4694 2015-01-04 07:12:00 2015-01-07 14:24:00
...                     ...     ...                 ...                 ...
2015-01-29 08:00:00  0.9209  0.3187 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-29 16:00:00  0.8568 -0.6510 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 00:00:00 -1.0342  0.6816 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 08:00:00 -0.8034 -0.6895 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 16:00:00 -0.4555  0.0175 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-31 00:00:00 -0.3540 -1.3750 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 08:00:00 -0.6436 -2.2234 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 16:00:00  0.6252 -1.6021 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 00:00:00 -1.1044  0.0522 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 08:00:00 -0.7396  1.5430 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 16:00:00 -1.2929  0.2671 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 00:00:00 -0.0393 -1.1681 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 08:00:00  0.5233 -0.1715 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 16:00:00  0.7718  0.8235 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-03 00:00:00  2.1632  1.3365 2015-01-30 16:48:00 2015-02-03 00:00:00

[100 rows x 4 columns]

df.groupby('start_time_index').agg('sum')

Out[90]: 
                          A       B
start_time_index                   
2015-01-01 00:00:00  8.6133  2.7734
2015-01-04 07:12:00  1.9220 -0.8069
2015-01-07 14:24:00 -8.1334  0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00  1.1957  7.2285
2015-01-17 12:00:00  3.2485  6.6841
2015-01-20 19:12:00 -0.8903  2.2802
2015-01-24 02:24:00 -2.1025  1.3800
2015-01-27 09:36:00 -1.1017  1.3108
2015-01-30 16:48:00 -0.0902 -2.5178

另一種可能的更短的方法是將采樣頻率指定為時間增量。 但問題,如下所示,它提供了11個子樣本而不是10個。我相信原因是resample實現了一個left-inclusive/right-exclusive (or left-exclusive/right-inclusive)子抽樣方案,以便'2015-02-03 00:00:00'的最后一個障礙被視為一個單獨的組。 如果我們使用pd.cut來自己完成,我們可以指定include_lowest=True這樣它就可以給我們10個子樣本而不是11個。

n = 10
time_delta_str = str((df.index[-1] - df.index[0]) / (pd.Timedelta('1s') * n)) + 's'
df.resample(pd.Timedelta(time_delta_str), how='sum')

Out[114]: 
                          A       B
2015-01-01 00:00:00  8.6133  2.7734
2015-01-04 07:12:00  1.9220 -0.8069
2015-01-07 14:24:00 -8.1334  0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00  1.1957  7.2285
2015-01-17 12:00:00  3.2485  6.6841
2015-01-20 19:12:00 -0.8903  2.2802
2015-01-24 02:24:00 -2.1025  1.3800
2015-01-27 09:36:00 -1.1017  1.3108
2015-01-30 16:48:00 -2.2534 -3.8543
2015-02-03 00:00:00  2.1632  1.3365
import numpy as np
import pandas as pd

n = 10
nrows = 33
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
print(df)
#             0
# 2000-01-01  1
# 2000-01-02  1
# ...
# 2000-02-01  1
# 2000-02-02  1

first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')
secs = int((last-first).total_seconds()//n)
periodsize = '{:d}S'.format(secs)

result = df.resample(periodsize, how='sum')
print('\n{}'.format(result))
assert len(result) == n

產量

                     0
2000-01-01 00:00:00  4
2000-01-04 07:12:00  3
2000-01-07 14:24:00  3
2000-01-10 21:36:00  4
2000-01-14 04:48:00  3
2000-01-17 12:00:00  3
2000-01-20 19:12:00  4
2000-01-24 02:24:00  3
2000-01-27 09:36:00  3
2000-01-30 16:48:00  3

0列中的值表示聚合的行數,因為原始DataFrame的值填充為1. 4和3的模式大致相同,因為33行無法均勻分組10組。


說明 :考慮這個更簡單的DataFrame:

n = 2
nrows = 5
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
#             0
# 2000-01-01  1
# 2000-01-02  1
# 2000-01-03  1
# 2000-01-04  1
# 2000-01-05  1

使用df.resample('2D', how='sum')給出錯誤的組數

In [366]: df.resample('2D', how='sum')
Out[366]: 
            0
2000-01-01  2
2000-01-03  2
2000-01-05  1

使用df.resample('3D', how='sum')給出正確數量的組,但第二組從2000-01-04開始,它不會將DataFrame平均分成兩個等間距組:

In [367]: df.resample('3D', how='sum')
Out[367]: 
            0
2000-01-01  3
2000-01-04  2

為了做得更好,我們需要以比幾天更好的時間分辨率工作。 由於Timedelta有一個total_seconds方法,讓我們在幾秒鍾內完成工作。 因此,對於上面的示例,期望的頻率字符串將是

In [374]: df.resample('216000S', how='sum')
Out[374]: 
                     0
2000-01-01 00:00:00  3
2000-01-03 12:00:00  2

因為5天內有216000 * 2秒:

In [373]: (pd.Timedelta(days=5) / pd.Timedelta('1S'))/2
Out[373]: 216000.0

好的,現在我們所需要的只是一種概括的方法。 我們想要索引中的最小和最大日期:

first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')

我們增加了一天,因為它使得天數差異正確。 在上面的示例中,2000-01-05和2000-01-01的時間戳之間只有4天,

In [377]: (pd.Timestamp('2000-01-05')-pd.Timestamp('2000-01-01')).days
Out[378]: 4

但正如我們在工作示例中所看到的,DataFrame有5行代表5天。 所以我們需要額外增加一天是有意義的。

現在我們可以計算每個等間距組中的正確秒數:

secs = int((last-first).total_seconds()//n)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM