[英]Pandas returns NaN as difference between column datetime
我有一個如下所示的數據框
*------------------------------------------------------------*
| started act_id from_state to_state|
*------------------------------------------------------------*
|2019-11-06 05:49:39.571392 2 CREATED ENABLED |
|2019-11-25 22:20:59.150339 2 ENABLED DISABLED |
|2019-11-26 10:22:36.571392 2 DISABLED ENABLED |
|2019-11-14 14:57:02.571392 3 CREATED ENABLED |
|2019-12-06 16:03:44.255603 3 ENABLED DISABLED |
|2019-12-12 12:50:48.571392 3 DISABLED ENABLED |
*------------------------------------------------------------*
我想通過act_id
計算以天為act_id
總時間,以顯示act_id
在act_id
停留的to_state
。 那么act_id
在狀態從ENABLED 變為DISABLED 之前處於ENABLED 或DISABLED 狀態需要多長時間?
這是我的代碼
import pandas as pd
import numpy as np
df = pd.read_csv('transitions.csv', index_col=0)
df['started'] = pd.to_datetime(df['started'])
df['total_time'] = 0
df['total_time'] = df.groupby(['account_id', 'from_state', 'to_state'])['started'].diff()/np.timedelta64(1, 'D')
df
但是當它在我的新字段total_time
輸出為NaN
而不是以天為單位顯示時
*------------------------------------------------------------------------------*
| started act_id from_state to_state total_time |
*------------------------------------------------------------------------------*
|2019-11-06 05:49:39.571392 2 CREATED ENABLED NaN |
|2019-11-25 22:20:59.150339 2 ENABLED DISABLED NaN |
|2019-11-26 10:22:36.571392 2 DISABLED ENABLED NaN |
|2019-11-14 14:57:02.571392 3 CREATED ENABLED NaN |
|2019-12-06 16:03:44.255603 3 ENABLED DISABLED NaN |
|2019-12-12 12:50:48.571392 3 DISABLED ENABLED NaN |
*------------------------------------------------------------------------------*
我希望我的預期輸出為
*------------------------------------------------------------------------------*
| started act_id from_state to_state total_time |
*------------------------------------------------------------------------------*
|2019-11-06 05:49:39.571392 2 CREATED ENABLED 0 |
|2019-11-25 22:20:59.150339 2 ENABLED DISABLED 19 |
|2019-11-26 10:22:36.571392 2 DISABLED ENABLED 1 |
|2019-11-14 14:57:02.571392 3 CREATED ENABLED 0 |
|2019-12-06 16:03:44.255603 3 ENABLED DISABLED 22 |
|2019-12-12 12:50:48.571392 3 DISABLED ENABLED 6 |
*------------------------------------------------------------------------------*
我哪里做錯了?
如果按所有 3 列分組,每組只包含一行,我認為這里有問題,所以差異總是NaT
。
但如果僅按ID
分組:
df['started'] = pd.to_datetime(df['started'])
df['total_time'] = (df.groupby('act_id')['started'].diff()/np.timedelta64(1, 'D')).fillna(0)
print (df)
started act_id from_state to_state total_time
0 2019-11-06 05:49:39.571392 2 CREATED ENABLED 0.000000
1 2019-11-25 22:20:59.150339 2 ENABLED DISABLED 19.688421
2 2019-11-26 10:22:36.571392 2 DISABLED ENABLED 0.501128
3 2019-11-14 14:57:02.571392 3 CREATED ENABLED 0.000000
4 2019-12-06 16:03:44.255603 3 ENABLED DISABLED 22.046316
5 2019-12-12 12:50:48.571392 3 DISABLED ENABLED 5.866022
如果還需要測試from
和to
state 可以shift
每個ID
列shift
to
to_state
,第一個值替換為from_state
並比較兩列如果相等,然后掩碼傳遞到最后一行代碼:
df['started'] = pd.to_datetime(df['started'])
df['to_state1'] = df.groupby('act_id')['to_state'].shift().fillna(df['from_state'])
print (df)
started act_id from_state to_state to_state1
0 2019-11-06 05:49:39.571392 2 CREATED ENABLED CREATED
1 2019-11-25 22:20:59.150339 2 ENABLED DISABLED ENABLED
2 2019-11-26 10:22:36.571392 2 DISABLED ENABLED DISABLED
3 2019-11-14 14:57:02.571392 3 CREATED ENABLED CREATED
4 2019-12-06 16:03:44.255603 3 ENABLED DISABLED ENABLED
5 2019-12-12 12:50:48.571392 3 DISABLED ENABLED DISABLED
m = df['from_state'].eq(df['to_state1'])
print (m)
0 True
1 True
2 True
3 True
4 True
5 True
dtype: bool
df['total_time'] = (df[m].groupby('act_id')['started'].diff()/np.timedelta64(1, 'D')).fillna(0)
print (df)
started act_id from_state to_state to_state1 \
0 2019-11-06 05:49:39.571392 2 CREATED ENABLED CREATED
1 2019-11-25 22:20:59.150339 2 ENABLED DISABLED ENABLED
2 2019-11-26 10:22:36.571392 2 DISABLED ENABLED DISABLED
3 2019-11-14 14:57:02.571392 3 CREATED ENABLED CREATED
4 2019-12-06 16:03:44.255603 3 ENABLED DISABLED ENABLED
5 2019-12-12 12:50:48.571392 3 DISABLED ENABLED DISABLED
total_time
0 0.000000
1 19.688421
2 0.501128
3 0.000000
4 22.046316
5 5.866022
df['started'] = pd.to_datetime(df['started'])
df = df.merge(pd.DataFrame( pd.DataFrame( df.groupby(['act_id', 'from_state', 'to_state']).count())), how='outer', indicator=False, on=['act_id', 'from_state', 'to_state'] )
您可能需要在合並后相應地重命名數據框。 希望這會給你答案
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.