簡體   English   中英

構造非重疊日期時間記錄(開始、結束日期時間) dataframe

[英]Construct non-overlapping datetime record (start, end datetime) dataframe

我需要創建一個 dataframe 來刪除多個ids的重疊startend日期時間。 我將使用startend日期時間來聚合高頻值 pandas dataframe,因此我需要刪除mst_df中那些重疊的日期時間。

import pandas as pd
 
#Proxy reference dataframe
master = [['site a', '2021-07-08 00:00:00', '2021-07-08 10:56:00'], 
        ['site a', '2021-07-08 06:00:00',   '2021-07-08 12:00:00'], #slightly overlapping
        ['site a', '2021-07-08 17:36:00',   '2021-07-09 11:40:00'],
        ['site a', '2021-07-08 18:00:00',   '2021-07-09 11:40:00'], #overlapping
        ['site a', '2021-07-09 00:00:00',   '2021-07-09 05:40:00'], #overlapping
        ['site b', '2021-07-08 00:00:00',   '2021-07-08 10:24:00'],
        ['site b', '2021-07-08 06:00:00',   '2021-07-08 10:24:00'], #overlapping
        ['site b', '2021-07-08 17:32:00',   '2021-07-09 11:12:00'],
        ['site b', '2021-07-08 18:00:00',   '2021-07-09 11:12:00'], #overlapping
        ['site b', '2021-07-09 00:00:00',   '2021-07-09 13:00:00']] #slightly overlapping

 
mst_df = pd.DataFrame(master, columns = ['id', 'start', 'end'])
mst_df['start'] = pd.to_datetime(mst_df['start'], infer_datetime_format=True)
mst_df['end'] = pd.to_datetime(mst_df['end'], infer_datetime_format=True)

所需 DataFrame:

    id      start               end
    site a  2021-07-08 00:00:00 2021-07-08 12:00:00
    site a  2021-07-08 17:36:00 2021-07-09 11:40:00
    site b  2021-07-08 00:00:00 2021-07-08 10:24:00
    site b  2021-07-08 17:32:00 2021-07-09 13:00:00

不知道pandas有沒有專門針對這個的function。 它有Interval.overlaping()來檢查兩個范圍是否重疊(它甚至可以與datetime一起工作)但我沒有看到 function 合並這兩個范圍所以它仍然需要自己的代碼來合並。 幸運的是這很容易。


行按start排序,因此當previous_end < next_start時行不會重疊,我在for循環中使用它。

但首先我按site分組以分別處理每個站點。

接下來我得到第一行( previous )並與其他行一起運行循環(如下next)並檢查previous_end < next_start

如果它是True ,那么我可以將previous放在結果列表中,然后像previous一個一樣獲取next一個以處理 rest 行。

如果它是False ,那么我從兩行創建新范圍並使用它來處理 rest 行。

最后,我將previous的內容添加到列表中。

處理完所有組后,我將其全部轉換為 DataFrame。

import pandas as pd
 
#Proxy reference dataframe
master = [
    ['site a', '2021-07-08 00:00:00',   '2021-07-08 10:56:00'], 
    ['site a', '2021-07-08 06:00:00',   '2021-07-08 12:00:00'], # slightly overlapping
    ['site a', '2021-07-08 17:36:00',   '2021-07-09 11:40:00'],
    ['site a', '2021-07-08 18:00:00',   '2021-07-09 11:40:00'], # overlapping
    ['site a', '2021-07-09 00:00:00',   '2021-07-09 05:40:00'], # overlapping
    ['site b', '2021-07-08 00:00:00',   '2021-07-08 10:24:00'],
    ['site b', '2021-07-08 06:00:00',   '2021-07-08 10:24:00'], # overlapping
    ['site b', '2021-07-08 17:32:00',   '2021-07-09 11:12:00'],
    ['site b', '2021-07-08 18:00:00',   '2021-07-09 11:12:00'], # overlapping
    ['site b', '2021-07-09 00:00:00',   '2021-07-09 13:00:00']  # slightly overlapping
]

mst_df = pd.DataFrame(master, columns = ['id', 'start', 'end'])

mst_df['start'] = pd.to_datetime(mst_df['start'], infer_datetime_format=True)
mst_df['end']   = pd.to_datetime(mst_df['end'], infer_datetime_format=True)

result = []

for val, group in mst_df.groupby('id'):
    
    # get first
    prev = group.iloc[0]
    
    for idx, item in group[1:].iterrows():
        if prev['end'] < item['start']:
            # not overlapping - put previous to results and use next as previous
            result.append(prev)
            prev = item
        else:
            # overlappig - create on range start, end
            prev['start'] = min(prev['start'], item['start'])
            prev['end']   = max(prev['end'], item['end'])
    
    # add when there is no next item
    result.append(prev)

print(pd.DataFrame(result))

結果:

       id               start                 end
0  site a 2021-07-08 00:00:00 2021-07-08 12:00:00
2  site a 2021-07-08 17:36:00 2021-07-09 11:40:00
5  site b 2021-07-08 00:00:00 2021-07-08 10:24:00
7  site b 2021-07-08 17:32:00 2021-07-09 13:00:00

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM