簡體   English   中英

使用 2 列分組,如果第三列的行之間的時間差小於 2 秒 python pandas

[英]Group by using 2 columns and if the time difference between the rows of third column is less than 2 seconds python pandas

我的 csv 中有這些數據

A_PERSON,B_PERSON,DATE_TIME,DURATION
190,390,'2020-04-20 12:44:36',323
282,811,'2020-04-06 11:12:24',25
495,414,'2020-04-20 11:22:13',11
827,158,'2020-04-30 13:27:22',22
827,158,'2020-04-30 13:27:44',15

我正在嘗試對具有相同A_PERSON,B_PERSON的行進行分組,並且如果一行的DATE_TIME + DURATION與另一行的DATE_TIME之間的差異小於 2 秒。 例如,在最后 2 行中,我有相同的A_PERSON,B_PERSON ,並且倒數第二行的 DATE_TIME DATE_TIME of second last row + DURATION of second last row DATE_TIME of last rows之間的差異小於 2 秒,因此只有最后一行應該合並,所有其他行將按原樣顯示。

所需 Output

A_PERSON,B_PERSON,DATE_TIME,DURATION
190,390,'2020-04-20 12:44:36',323
282,811,'2020-04-06 11:12:24',25
495,414,'2020-04-20 11:22:13',11
827,158,'2020-04-30 13:27:22',37

直到現在我已經嘗試過這個

def merger(dataframe:pd.core.frame.DataFrame)->pd.core.frame.DataFrame:
    dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])
    dataframe['epoch'] = (dataframe['DATE_TIME'] - datetime.datetime(1970,1,1)).dt.total_seconds()
    mask = dataframe[((dataframe['epoch']) < dataframe['epoch'] + 1 + dataframe['DURATION'])]
    grouped = mask.groupby(["A_PERSON", "B_PERSON"]).sum("DURATION")
    print(grouped)
    return grouped

在 A_PERSON 上的此代碼group byA_PERSON,B_PERSON正在工作,但where mask不起作用

樣品 2

A_PERSON,B_PERSON,DATE_TIME,DURATION
441785807190,4299330390,'2020-04-20 12:44:36',323
441785808282,4238900811,'2020-04-06 11:12:24',25
14244012495,3104405414,'2020-04-20 11:22:13',11
96897940827,3139578158,'2020-04-30 13:27:02',32
96897940827,3139578158,'2020-04-30 13:27:34',16

樣品 2 所需的 output

A_PERSON,B_PERSON,DATE_TIME,DURATION
441785807190,4299330390,'2020-04-20 12:44:36',323
441785808282,4238900811,'2020-04-06 11:12:24',25
14244012495,3104405414,'2020-04-20 11:22:13',11
96897940827,3139578158,'2020-04-30 13:27:02',48

樣品 3

A_PERSON,B_PERSON,DATE_TIME,DURATION
441785807190,4299330390,'2020-04-20 12:44:36',323
96897940827,3139578158,'2020-04-30 13:27:00',33
441785808282,4238900811,'2020-04-06 11:12:24',25
14244012495,3104405414,'2020-04-20 11:22:13',11
96897940827,3139578158,'2020-04-30 13:27:34',16

樣品 3 所需的 output

A_PERSON,B_PERSON,DATE_TIME,DURATION
441785807190,4299330390,'2020-04-20 12:44:36',323
96897940827,3139578158,'2020-04-30 13:27:00',49
441785808282,4238900811,'2020-04-06 11:12:24',25
14244012495,3104405414,'2020-04-20 11:22:13',11

在示例數據中,最后一組數據相差5秒( 13:27:59 - 13:27:54 = 5seconds )。

解決方案是在幾秒鍾內將DURATION添加到新列add中,並且每個組通過DataFrameGroupBy.diff獲得差異,比較反轉條件以獲得更大的N與新組列的累積總和,最后聚合firstsum

N = 5
dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])

dataframe['add'] = dataframe['DATE_TIME'] + pd.to_timedelta(dataframe['DURATION'],unit='s')
f = lambda x: x.diff().dt.total_seconds().gt(N).cumsum()
dataframe['g'] =  dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].transform(f)
print (dataframe)
   A_PERSON  B_PERSON           DATE_TIME  DURATION                 add  g
0       190       390 2020-04-20 12:44:36       323 2020-04-20 12:49:59  0
1       282       811 2020-04-06 11:12:24        25 2020-04-06 11:12:49  0
2       495       414 2020-04-20 11:22:13        11 2020-04-20 11:22:24  0
3       827       158 2020-04-30 13:27:32        22 2020-04-30 13:27:54  0
4       827       158 2020-04-30 13:27:44        15 2020-04-30 13:27:59  0

dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
                      .agg({'DATE_TIME':'first', 'DURATION':'sum'})
                      .droplevel(-1)
                      .reset_index())

print (dataframe)
   A_PERSON  B_PERSON           DATE_TIME  DURATION
0       190       390 2020-04-20 12:44:36       323
1       282       811 2020-04-06 11:12:24        25
2       495       414 2020-04-20 11:22:13        11
3       827       158 2020-04-30 13:27:32        37

如果需要按DATE_TIME列解決方案(使用新數據)比較add每組移位是:

N = 2

dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])

dataframe['add'] = dataframe['DATE_TIME'] + pd.to_timedelta(dataframe['DURATION'],unit='s')
dataframe['diff'] = dataframe['DATE_TIME'].sub(dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].shift()).dt.total_seconds().gt(N)

dataframe['g'] = dataframe.groupby(["A_PERSON", "B_PERSON"])['diff'].cumsum()
print (dataframe)
   A_PERSON  B_PERSON           DATE_TIME  DURATION                 add  \
0       190       390 2020-04-20 12:44:36       323 2020-04-20 12:49:59   
1       282       811 2020-04-06 11:12:24        25 2020-04-06 11:12:49   
2       495       414 2020-04-20 11:22:13        11 2020-04-20 11:22:24   
3       827       158 2020-04-30 13:27:22        22 2020-04-30 13:27:44   
4       827       158 2020-04-30 13:27:44        15 2020-04-30 13:27:59   

    diff  g  
0  False  0  
1  False  0  
2  False  0  
3  False  0  
4  False  0  

dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
                      .agg({'DATE_TIME':'first', 'DURATION':'sum'})
                      .droplevel(-1)
                      .reset_index())

print (dataframe)
   A_PERSON  B_PERSON           DATE_TIME  DURATION
0       190       390 2020-04-20 12:44:36       323
1       282       811 2020-04-06 11:12:24        25
2       495       414 2020-04-20 11:22:13        11
3       827       158 2020-04-30 13:27:22        37

測試的第三個樣本:

N = 2

dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])

dataframe['add'] = dataframe['DATE_TIME'] + pd.to_timedelta(dataframe['DURATION'],unit='s')
dataframe['diff'] = dataframe['DATE_TIME'].sub(dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].shift()).dt.total_seconds().gt(N)

dataframe['g'] = dataframe.groupby(["A_PERSON", "B_PERSON"])['diff'].cumsum()
print (dataframe)
       A_PERSON    B_PERSON           DATE_TIME  DURATION                 add  \
0  441785807190  4299330390 2020-04-20 12:44:36       323 2020-04-20 12:49:59   
1   96897940827  3139578158 2020-04-30 13:27:00        33 2020-04-30 13:27:33   
2  441785808282  4238900811 2020-04-06 11:12:24        25 2020-04-06 11:12:49   
3   14244012495  3104405414 2020-04-20 11:22:13        11 2020-04-20 11:22:24   
4   96897940827  3139578158 2020-04-30 13:27:34        16 2020-04-30 13:27:50   

    diff  g  
0  False  0  
1  False  0  
2  False  0  
3  False  0  
4  False  0  

dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
                      .agg({'DATE_TIME':'first', 'DURATION':'sum'})
                      .droplevel(-1)
                      .reset_index())

print (dataframe)
       A_PERSON    B_PERSON           DATE_TIME  DURATION
0   14244012495  3104405414 2020-04-20 11:22:13        11
1   96897940827  3139578158 2020-04-30 13:27:00        49
2  441785807190  4299330390 2020-04-20 12:44:36       323
3  441785808282  4238900811 2020-04-06 11:12:24        25

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM