![](/img/trans.png)
[英]Python/Pandas - How to group by two columns and count rows with value from third column between two numbers
[英]Group by using 2 columns and if the time difference between the rows of third column is less than 2 seconds python pandas
我的 csv 中有這些數據
A_PERSON,B_PERSON,DATE_TIME,DURATION
190,390,'2020-04-20 12:44:36',323
282,811,'2020-04-06 11:12:24',25
495,414,'2020-04-20 11:22:13',11
827,158,'2020-04-30 13:27:22',22
827,158,'2020-04-30 13:27:44',15
我正在嘗試對具有相同A_PERSON,B_PERSON
的行進行分組,並且如果一行的DATE_TIME + DURATION
與另一行的DATE_TIME
之間的差異小於 2 秒。 例如,在最后 2 行中,我有相同的A_PERSON,B_PERSON
,並且倒數第二行的 DATE_TIME DATE_TIME of second last row + DURATION of second last row
DATE_TIME of last rows
之間的差異小於 2 秒,因此只有最后一行應該合並,所有其他行將按原樣顯示。
所需 Output
A_PERSON,B_PERSON,DATE_TIME,DURATION
190,390,'2020-04-20 12:44:36',323
282,811,'2020-04-06 11:12:24',25
495,414,'2020-04-20 11:22:13',11
827,158,'2020-04-30 13:27:22',37
直到現在我已經嘗試過這個
def merger(dataframe:pd.core.frame.DataFrame)->pd.core.frame.DataFrame:
dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])
dataframe['epoch'] = (dataframe['DATE_TIME'] - datetime.datetime(1970,1,1)).dt.total_seconds()
mask = dataframe[((dataframe['epoch']) < dataframe['epoch'] + 1 + dataframe['DURATION'])]
grouped = mask.groupby(["A_PERSON", "B_PERSON"]).sum("DURATION")
print(grouped)
return grouped
在 A_PERSON 上的此代碼group by
中A_PERSON,B_PERSON
正在工作,但where mask
不起作用
樣品 2
A_PERSON,B_PERSON,DATE_TIME,DURATION
441785807190,4299330390,'2020-04-20 12:44:36',323
441785808282,4238900811,'2020-04-06 11:12:24',25
14244012495,3104405414,'2020-04-20 11:22:13',11
96897940827,3139578158,'2020-04-30 13:27:02',32
96897940827,3139578158,'2020-04-30 13:27:34',16
樣品 2 所需的 output
A_PERSON,B_PERSON,DATE_TIME,DURATION
441785807190,4299330390,'2020-04-20 12:44:36',323
441785808282,4238900811,'2020-04-06 11:12:24',25
14244012495,3104405414,'2020-04-20 11:22:13',11
96897940827,3139578158,'2020-04-30 13:27:02',48
樣品 3
A_PERSON,B_PERSON,DATE_TIME,DURATION
441785807190,4299330390,'2020-04-20 12:44:36',323
96897940827,3139578158,'2020-04-30 13:27:00',33
441785808282,4238900811,'2020-04-06 11:12:24',25
14244012495,3104405414,'2020-04-20 11:22:13',11
96897940827,3139578158,'2020-04-30 13:27:34',16
樣品 3 所需的 output
A_PERSON,B_PERSON,DATE_TIME,DURATION
441785807190,4299330390,'2020-04-20 12:44:36',323
96897940827,3139578158,'2020-04-30 13:27:00',49
441785808282,4238900811,'2020-04-06 11:12:24',25
14244012495,3104405414,'2020-04-20 11:22:13',11
在示例數據中,最后一組數據相差5
秒( 13:27:59 - 13:27:54 = 5seconds
)。
解決方案是在幾秒鍾內將DURATION
添加到新列add
中,並且每個組通過DataFrameGroupBy.diff
獲得差異,比較反轉條件以獲得更大的N
與新組列的累積總和,最后聚合first
和sum
:
N = 5
dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])
dataframe['add'] = dataframe['DATE_TIME'] + pd.to_timedelta(dataframe['DURATION'],unit='s')
f = lambda x: x.diff().dt.total_seconds().gt(N).cumsum()
dataframe['g'] = dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].transform(f)
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION add g
0 190 390 2020-04-20 12:44:36 323 2020-04-20 12:49:59 0
1 282 811 2020-04-06 11:12:24 25 2020-04-06 11:12:49 0
2 495 414 2020-04-20 11:22:13 11 2020-04-20 11:22:24 0
3 827 158 2020-04-30 13:27:32 22 2020-04-30 13:27:54 0
4 827 158 2020-04-30 13:27:44 15 2020-04-30 13:27:59 0
dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
.agg({'DATE_TIME':'first', 'DURATION':'sum'})
.droplevel(-1)
.reset_index())
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION
0 190 390 2020-04-20 12:44:36 323
1 282 811 2020-04-06 11:12:24 25
2 495 414 2020-04-20 11:22:13 11
3 827 158 2020-04-30 13:27:32 37
如果需要按DATE_TIME
列解決方案(使用新數據)比較add
每組移位是:
N = 2
dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])
dataframe['add'] = dataframe['DATE_TIME'] + pd.to_timedelta(dataframe['DURATION'],unit='s')
dataframe['diff'] = dataframe['DATE_TIME'].sub(dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].shift()).dt.total_seconds().gt(N)
dataframe['g'] = dataframe.groupby(["A_PERSON", "B_PERSON"])['diff'].cumsum()
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION add \
0 190 390 2020-04-20 12:44:36 323 2020-04-20 12:49:59
1 282 811 2020-04-06 11:12:24 25 2020-04-06 11:12:49
2 495 414 2020-04-20 11:22:13 11 2020-04-20 11:22:24
3 827 158 2020-04-30 13:27:22 22 2020-04-30 13:27:44
4 827 158 2020-04-30 13:27:44 15 2020-04-30 13:27:59
diff g
0 False 0
1 False 0
2 False 0
3 False 0
4 False 0
dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
.agg({'DATE_TIME':'first', 'DURATION':'sum'})
.droplevel(-1)
.reset_index())
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION
0 190 390 2020-04-20 12:44:36 323
1 282 811 2020-04-06 11:12:24 25
2 495 414 2020-04-20 11:22:13 11
3 827 158 2020-04-30 13:27:22 37
測試的第三個樣本:
N = 2
dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])
dataframe['add'] = dataframe['DATE_TIME'] + pd.to_timedelta(dataframe['DURATION'],unit='s')
dataframe['diff'] = dataframe['DATE_TIME'].sub(dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].shift()).dt.total_seconds().gt(N)
dataframe['g'] = dataframe.groupby(["A_PERSON", "B_PERSON"])['diff'].cumsum()
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION add \
0 441785807190 4299330390 2020-04-20 12:44:36 323 2020-04-20 12:49:59
1 96897940827 3139578158 2020-04-30 13:27:00 33 2020-04-30 13:27:33
2 441785808282 4238900811 2020-04-06 11:12:24 25 2020-04-06 11:12:49
3 14244012495 3104405414 2020-04-20 11:22:13 11 2020-04-20 11:22:24
4 96897940827 3139578158 2020-04-30 13:27:34 16 2020-04-30 13:27:50
diff g
0 False 0
1 False 0
2 False 0
3 False 0
4 False 0
dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
.agg({'DATE_TIME':'first', 'DURATION':'sum'})
.droplevel(-1)
.reset_index())
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION
0 14244012495 3104405414 2020-04-20 11:22:13 11
1 96897940827 3139578158 2020-04-30 13:27:00 49
2 441785807190 4299330390 2020-04-20 12:44:36 323
3 441785808282 4238900811 2020-04-06 11:12:24 25
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.