[英]Filtering a panda dataframe based on value and time
我有一個這樣的熊貓數據框
2011-5-5 12:43 noEvent CarA otherColumns...
2011-5-5 12:45 noEvent CarA ...
2011-5-5 12:49 EVENT CarA ...
2011-5-5 12:51 noEvent CarA ...
(no data - jumps in time)
2011-5-6 12:52 EVENT CarA ...
2011-5-6 12:59 noEvent CarA ...
2011-5-6 13:00 noEvent CarA ...
2011-5-5 12:43 noEvent CarB ...
2011-5-5 12:45 noEvent CarB ...
2011-5-5 12:49 noEvent CarB ...
2011-5-5 12:51 noEvent CarB ...
(no data - jumps in time)
2011-5-6 12:52 noEvent CarB ...
2011-5-6 12:52 EVENT CarB ...
2011-5-6 13:00 noEvent CarB ...
說明:
我需要在發生事件前后為每輛車+ -2分鍾進行一些計算。
為此,我很困惑...如何過濾此數據框?
所需的結果如下所示
-2min
2011-5-5 12:49 EVENT CarA ...
+2min
-2min
2011-5-6 12:52 EVENT CarA ...
+2min
-2min
2011-5-6 12:52 EVENT CarB ...
+2min
一些信息:
我不知道從哪里開始。
首先按“汽車”列分組,然后按以下方式處理每個組:
首先創建測試數據:
import pandas as pd
import numpy as np
np.random.seed(1)
idx = pd.date_range("2016-03-01 10:00:00", "2016-03-01 20:00:00", freq="S")
idx = idx[np.random.randint(0, len(idx), 10000)].sort_values()
evt = np.array(["no event", "event"])[(np.random.rand(len(idx)) < 0.0005).astype(int)]
df = pd.DataFrame({"event":evt, "value":np.random.randint(0, 10, len(evt))}, index=idx)
找到事件行和+/- 10秒的行索引:
event_time = df.index[df.event == "event"]
delta = pd.Timedelta(10, unit="s")
start_idx = df.index.searchsorted(event_time - delta).tolist()
end_idx = df.index.searchsorted(event_time + delta).tolist()
創建遮罩數組:
mask = np.zeros(df.shape[0], dtype=bool)
evt_id = np.zeros(df.shape[0], dtype=int)
for i, (s, e) in enumerate(zip(start_idx, end_idx)):
mask[s:e] = True
evt_id[s:e] = i
使用mask數組過濾行,這里我創建一個event_id列來對事件進行分組:
df_event = df[mask]
df_event["event_id"] = evt_id[mask]
輸出:
event value event_id
2016-03-01 13:51:48 no event 0 0
2016-03-01 13:51:51 event 8 0
2016-03-01 13:51:53 no event 3 0
2016-03-01 13:52:00 no event 1 0
2016-03-01 14:21:00 no event 2 1
2016-03-01 14:21:00 no event 5 1
2016-03-01 14:21:00 no event 0 1
2016-03-01 14:21:02 no event 1 1
2016-03-01 14:21:04 no event 2 1
2016-03-01 14:21:06 no event 0 1
2016-03-01 14:21:07 event 1 1
2016-03-01 14:21:16 no event 1 1
2016-03-01 14:21:16 no event 9 1
2016-03-01 15:09:42 no event 1 2
2016-03-01 15:09:49 event 7 2
2016-03-01 15:09:54 no event 3 2
2016-03-01 15:09:55 no event 3 2
2016-03-01 15:09:58 no event 5 2
2016-03-01 15:09:58 no event 9 2
2016-03-01 17:36:44 no event 8 3
2016-03-01 17:36:44 no event 2 3
2016-03-01 17:36:44 no event 9 3
2016-03-01 17:36:45 no event 2 3
2016-03-01 17:36:49 event 9 3
2016-03-01 17:36:50 no event 6 3
2016-03-01 17:36:54 no event 1 3
2016-03-01 17:36:56 no event 1 3
2016-03-01 18:51:37 no event 5 4
2016-03-01 18:51:37 no event 3 4
2016-03-01 18:51:42 no event 0 4
2016-03-01 18:51:47 event 9 4
2016-03-01 18:51:55 no event 4 4
考慮交叉聯接合並,比較所有事件篩選的數據幀和完整的數據幀。 然后子集記錄同一輛車在+/- 2分鍾內掉落:
數據框設置(示例已發布數據)
import pandas as pd
import datetime
df = pd.DataFrame({'Date': ['5/5/2011 12:43', '5/5/2011 12:45', '5/5/2011 12:49',
'5/5/2011 12:51', '5/6/2011 12:52', '5/6/2011 12:59',
'5/6/2011 13:00', '5/5/2011 12:43', '5/5/2011 12:45',
'5/5/2011 12:49', '5/5/2011 12:51', '5/6/2011 12:52',
'5/6/2011 12:52', '5/6/2011 13:00'],
'Event': ['noEvent', 'noEvent', 'EVENT', 'noEvent','EVENT',
'noEvent', 'noEvent', 'noEvent', 'noEvent', 'noEvent',
'noEvent', 'noEvent', 'EVENT', 'noEvent'],
'Car': ['CarA', 'CarA', 'CarA', 'CarA', 'CarA',
'CarA', 'CarA', 'CarB', 'CarB','CarB',
'CarB', 'CarB', 'CarB', 'CarB']})
df['Date'] = pd.to_datetime(df['Date'])
# Car Date Event
# 0 CarA 2011-05-05 12:43:00 noEvent
# 1 CarA 2011-05-05 12:45:00 noEvent
# 2 CarA 2011-05-05 12:49:00 EVENT
# 3 CarA 2011-05-05 12:51:00 noEvent
# 4 CarA 2011-05-06 12:52:00 EVENT
# 5 CarA 2011-05-06 12:59:00 noEvent
# 6 CarA 2011-05-06 13:00:00 noEvent
# 7 CarB 2011-05-05 12:43:00 noEvent
# 8 CarB 2011-05-05 12:45:00 noEvent
# 9 CarB 2011-05-05 12:49:00 noEvent
# 10 CarB 2011-05-05 12:51:00 noEvent
# 11 CarB 2011-05-06 12:52:00 noEvent
# 12 CarB 2011-05-06 12:52:00 EVENT
# 13 CarB 2011-05-06 13:00:00 noEvent
交叉連接(返回兩個配對MXN之間的完整組合集)
df['key'] = 1
# EVENTS DF
eventsdf = df[df['Event']=='EVENT']
# CROSS JOIN DF
crossdf = pd.merge(df, eventsdf, on='key')
crossdf = crossdf[((crossdf['Date_x'] <= crossdf['Date_y']
+ datetime.timedelta(minutes=2)) &
(crossdf['Date_x'] >= crossdf['Date_y']
- datetime.timedelta(minutes=2))) &
(crossdf['Car_x'] == crossdf['Car_y'])].sort_values('Date_x')
finaldf = crossdf[['Car_x', 'Date_x', 'Event_x']].drop_duplicates().sort_values('Car_x')
finaldf.columns = ['Car', 'Date', 'Event']
# Car Date Event
# 6 CarA 2011-05-05 12:49:00 EVENT
# 9 CarA 2011-05-05 12:51:00 noEvent
# 13 CarA 2011-05-06 12:52:00 EVENT
# 35 CarB 2011-05-06 12:52:00 noEvent
# 38 CarB 2011-05-06 12:52:00 EVENT
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.