![](/img/trans.png)
[英]Resampling by day and category a DataFrame that have datetime start and datetime end
[英]Construct non-overlapping datetime record (start, end datetime) dataframe
我需要創建一個 dataframe 來刪除多個ids
的重疊start
和end
日期時間。 我將使用start
和end
日期時間來聚合高頻值 pandas dataframe,因此我需要刪除mst_df
中那些重疊的日期時間。
import pandas as pd
#Proxy reference dataframe
master = [['site a', '2021-07-08 00:00:00', '2021-07-08 10:56:00'],
['site a', '2021-07-08 06:00:00', '2021-07-08 12:00:00'], #slightly overlapping
['site a', '2021-07-08 17:36:00', '2021-07-09 11:40:00'],
['site a', '2021-07-08 18:00:00', '2021-07-09 11:40:00'], #overlapping
['site a', '2021-07-09 00:00:00', '2021-07-09 05:40:00'], #overlapping
['site b', '2021-07-08 00:00:00', '2021-07-08 10:24:00'],
['site b', '2021-07-08 06:00:00', '2021-07-08 10:24:00'], #overlapping
['site b', '2021-07-08 17:32:00', '2021-07-09 11:12:00'],
['site b', '2021-07-08 18:00:00', '2021-07-09 11:12:00'], #overlapping
['site b', '2021-07-09 00:00:00', '2021-07-09 13:00:00']] #slightly overlapping
mst_df = pd.DataFrame(master, columns = ['id', 'start', 'end'])
mst_df['start'] = pd.to_datetime(mst_df['start'], infer_datetime_format=True)
mst_df['end'] = pd.to_datetime(mst_df['end'], infer_datetime_format=True)
所需 DataFrame:
id start end
site a 2021-07-08 00:00:00 2021-07-08 12:00:00
site a 2021-07-08 17:36:00 2021-07-09 11:40:00
site b 2021-07-08 00:00:00 2021-07-08 10:24:00
site b 2021-07-08 17:32:00 2021-07-09 13:00:00
不知道pandas
有沒有專門針對這個的function。 它有Interval.overlaping()來檢查兩個范圍是否重疊(它甚至可以與datetime
一起工作)但我沒有看到 function 合並這兩個范圍所以它仍然需要自己的代碼來合並。 幸運的是這很容易。
行按start
排序,因此當previous_end < next_start
時行不會重疊,我在for
循環中使用它。
但首先我按site
分組以分別處理每個站點。
接下來我得到第一行( previous
)並與其他行一起運行循環(如下next)
並檢查previous_end < next_start
。
如果它是True
,那么我可以將previous
放在結果列表中,然后像previous
一個一樣獲取next
一個以處理 rest 行。
如果它是False
,那么我從兩行創建新范圍並使用它來處理 rest 行。
最后,我將previous
的內容添加到列表中。
處理完所有組后,我將其全部轉換為 DataFrame。
import pandas as pd
#Proxy reference dataframe
master = [
['site a', '2021-07-08 00:00:00', '2021-07-08 10:56:00'],
['site a', '2021-07-08 06:00:00', '2021-07-08 12:00:00'], # slightly overlapping
['site a', '2021-07-08 17:36:00', '2021-07-09 11:40:00'],
['site a', '2021-07-08 18:00:00', '2021-07-09 11:40:00'], # overlapping
['site a', '2021-07-09 00:00:00', '2021-07-09 05:40:00'], # overlapping
['site b', '2021-07-08 00:00:00', '2021-07-08 10:24:00'],
['site b', '2021-07-08 06:00:00', '2021-07-08 10:24:00'], # overlapping
['site b', '2021-07-08 17:32:00', '2021-07-09 11:12:00'],
['site b', '2021-07-08 18:00:00', '2021-07-09 11:12:00'], # overlapping
['site b', '2021-07-09 00:00:00', '2021-07-09 13:00:00'] # slightly overlapping
]
mst_df = pd.DataFrame(master, columns = ['id', 'start', 'end'])
mst_df['start'] = pd.to_datetime(mst_df['start'], infer_datetime_format=True)
mst_df['end'] = pd.to_datetime(mst_df['end'], infer_datetime_format=True)
result = []
for val, group in mst_df.groupby('id'):
# get first
prev = group.iloc[0]
for idx, item in group[1:].iterrows():
if prev['end'] < item['start']:
# not overlapping - put previous to results and use next as previous
result.append(prev)
prev = item
else:
# overlappig - create on range start, end
prev['start'] = min(prev['start'], item['start'])
prev['end'] = max(prev['end'], item['end'])
# add when there is no next item
result.append(prev)
print(pd.DataFrame(result))
結果:
id start end
0 site a 2021-07-08 00:00:00 2021-07-08 12:00:00
2 site a 2021-07-08 17:36:00 2021-07-09 11:40:00
5 site b 2021-07-08 00:00:00 2021-07-08 10:24:00
7 site b 2021-07-08 17:32:00 2021-07-09 13:00:00
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.