[英]Calculate TimeDelta within certain frequency in pandas DataFrame
我有一個像這樣的熊貓DataFrame(實際的DataFrame有成千上萬的行):
td
2011-08-14 09:09:14 00:00:13
2011-08-14 09:09:27 00:02:25
2011-08-14 09:11:52 00:00:05
2011-08-14 09:11:57 00:20:41
2011-08-14 09:32:38 00:03:05
2011-08-14 09:35:43 00:05:44
2011-08-14 09:41:27 00:07:07
2011-08-14 09:48:34 00:01:51
2011-08-14 09:50:25 00:06:08
2011-08-14 09:56:33 01:08:39
2011-08-14 10:05:12 00:04:51
2011-08-14 10:10:03 00:06:36
2011-08-14 10:16:39 00:00:13
2011-08-14 10:16:52 00:18:25
2011-08-14 10:35:17 00:00:05
2011-08-14 10:35:22 00:24:24
2011-08-14 10:59:46 00:27:44
現在,我想將索引重新采樣為小時,如下所示:
2011-08-14 09:00:00 01:55:58
2011-08-14 10:00:00 00:00:00
2011-08-14 11:00:00 01:22:18
Freq: H, Name: td, dtype: timedelta64[ns]
但是我需要將生成的時間增量與“頻率”對齊,因此在此示例中為小時! 所需的結果應如下所示:
2011-08-14 09:00:00 01:00:00
2011-08-14 10:00:00 00:55:58 # <- carryover from previous row
2011-08-14 11:00:00 01:00:00
2011-08-14 12:00:00 00:22:18 # <- carryover from previous row
Freq: H, Name: td, dtype: timedelta64[ns]
這是一個簡單的代碼片段:
import pandas as pd
index = [
'2011-08-14 09:09:14',
'2011-08-14 09:09:27',
'2011-08-14 09:11:52',
'2011-08-14 09:11:57',
'2011-08-14 09:32:38',
'2011-08-14 09:35:43',
'2011-08-14 09:41:27',
'2011-08-14 09:48:34',
'2011-08-14 09:50:25',
'2011-08-14 09:56:33',
'2011-08-14 11:05:12',
'2011-08-14 11:10:03',
'2011-08-14 11:16:39',
'2011-08-14 11:16:52',
'2011-08-14 11:35:17',
'2011-08-14 11:35:22',
'2011-08-14 11:59:46',
'2011-08-14 11:59:46'
]
data = [
13000000000,
145000000000,
5000000000,
1241000000000,
185000000000,
344000000000,
427000000000,
111000000000,
368000000000,
4119000000000,
291000000000,
396000000000,
13000000000,
1105000000000,
5000000000,
1464000000000,
1664000000000,
0000000000
]
df = pd.DataFrame(data, columns=['td'], index=pd.DatetimeIndex(index), dtype='timedelta64[ns]')
print(df)
print(df.resample('H').td.sum())
這是我的解決方案。 本質上,每次您都添加前一天的結轉(時間增量減去1小時),然后將前一天的時間增量限制為1小時。
最后,如果最后一個時間增量超過1小時,則可能還需要擴展列表。
代碼可能更干,但這應該使您走上正確的路:
resampled = df.resample('H').td.sum()
# Initialise output. Make copy as we will modify values in-place
out = resampled.astype(pd.Timedelta).copy().values.tolist()
extended_idx = resampled.index.tolist()
def days_hours_minutes_seconds(td):
return td.days, td.seconds//3600, (td.seconds//60)%60, td.seconds%60
def carry_over(td):
# Calculate carry-over as excess of 1 hour
days, hours, minutes, seconds = days_hours_minutes_seconds(td)
if hours >=1:
return pd.Timedelta('%d days %d hours %d min %d sec' % (days, hours - 1, minutes, seconds))
else:
return pd.Timedelta(0)
# Carry over
for idx in range(1, len(out)):
prev = out[idx-1]
out[idx] += carry_over(prev)
out[idx-1] = min(prev, pd.Timedelta('1 hours'))
# Extend the list if last time delta is more than 1 hour
done = out[-1] <= pd.Timedelta('1 hours')
while not done:
extended_idx.append(extended_idx[-1] + pd.Timedelta('1 hours'))
out.append(carry_over(out[-1]))
out[-2] = min(out[-2], pd.Timedelta('1 hours'))
if out[-1] <= pd.Timedelta('1 hours'):
done = True
out = pd.Series(out, index=extended_idx)
半向量化方法
df2 = df.resample('H').td.sum().fillna(pd.Timedelta(0))
limit = pd.Timedelta('1H')
while((df2 > limit).any()):
df3 = df2.shift()
last = df2.index[-1]
if df2[last] > limit:
df2[last + limit] = df2[last] - limit
df3[last + limit] = df2[last] - limit
carry_over = df3 > limit
df2.loc[df2 > limit] = limit
df2[carry_over] = df2[carry_over] + df3.loc[carry_over] - limit
2011-08-14 09:00:00 01:00:00 2011-08-14 10:00:00 00:55:58 2011-08-14 11:00:00 01:00:00 2011-08-14 12:00:00 00:22:18 Freq: H, Name: td, dtype: timedelta64[ns]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.