简体   繁体   中英

Pandas returns NaN as difference between column datetime

I have a dataframe as below

*------------------------------------------------------------*
| started                   act_id    from_state     to_state|
*------------------------------------------------------------*                      
|2019-11-06 05:49:39.571392  2         CREATED      ENABLED  |
|2019-11-25 22:20:59.150339  2         ENABLED      DISABLED |
|2019-11-26 10:22:36.571392  2         DISABLED     ENABLED  |
|2019-11-14 14:57:02.571392  3         CREATED      ENABLED  |
|2019-12-06 16:03:44.255603  3         ENABLED      DISABLED |
|2019-12-12 12:50:48.571392  3         DISABLED     ENABLED  |
*------------------------------------------------------------*

I want to calculate total time in days by act_id to show how long that act_id stayed in to_state . So how long act_id in status ENABLED or DISABLED before it changed it state from ENABLED to DISABLED?

here is my code

import pandas as pd
import numpy as np

df = pd.read_csv('transitions.csv', index_col=0)

df['started'] = pd.to_datetime(df['started'])

df['total_time'] = 0

df['total_time'] = df.groupby(['account_id', 'from_state', 'to_state'])['started'].diff()/np.timedelta64(1, 'D')

df

but when it is giving me output in my new field total_time as NaN instead of showing in days

*------------------------------------------------------------------------------*
| started                   act_id    from_state     to_state       total_time |
*------------------------------------------------------------------------------*                        
|2019-11-06 05:49:39.571392  2         CREATED      ENABLED          NaN       |
|2019-11-25 22:20:59.150339  2         ENABLED      DISABLED         NaN       |
|2019-11-26 10:22:36.571392  2         DISABLED     ENABLED          NaN       |
|2019-11-14 14:57:02.571392  3         CREATED      ENABLED          NaN       |
|2019-12-06 16:03:44.255603  3         ENABLED      DISABLED         NaN       |
|2019-12-12 12:50:48.571392  3         DISABLED     ENABLED          NaN       |
*------------------------------------------------------------------------------*

I want my expected output as

*------------------------------------------------------------------------------*
| started                   act_id    from_state     to_state       total_time |
*------------------------------------------------------------------------------*                        
|2019-11-06 05:49:39.571392  2         CREATED      ENABLED          0         |
|2019-11-25 22:20:59.150339  2         ENABLED      DISABLED         19        |
|2019-11-26 10:22:36.571392  2         DISABLED     ENABLED          1         |
|2019-11-14 14:57:02.571392  3         CREATED      ENABLED          0         |
|2019-12-06 16:03:44.255603  3         ENABLED      DISABLED         22        |
|2019-12-12 12:50:48.571392  3         DISABLED     ENABLED          6         |
*------------------------------------------------------------------------------*

where I am doing wrong?

I think problem here if grouping by all 3 columns each group contains only one row, so difference is always NaT .

But if grouping by ID only:

df['started'] = pd.to_datetime(df['started'])

df['total_time'] = (df.groupby('act_id')['started'].diff()/np.timedelta64(1, 'D')).fillna(0)
print (df)
                     started  act_id from_state  to_state  total_time
0 2019-11-06 05:49:39.571392       2    CREATED   ENABLED    0.000000
1 2019-11-25 22:20:59.150339       2    ENABLED  DISABLED   19.688421
2 2019-11-26 10:22:36.571392       2   DISABLED   ENABLED    0.501128
3 2019-11-14 14:57:02.571392       3    CREATED   ENABLED    0.000000
4 2019-12-06 16:03:44.255603       3    ENABLED  DISABLED   22.046316
5 2019-12-12 12:50:48.571392       3   DISABLED   ENABLED    5.866022

If need test also from and to state is possible shift column to_state per ID , first values replace by from_state and compare both columns if equal, then mask pass to last line of code:

df['started'] = pd.to_datetime(df['started'])

df['to_state1'] = df.groupby('act_id')['to_state'].shift().fillna(df['from_state'])
print (df)
                     started  act_id from_state  to_state to_state1
0 2019-11-06 05:49:39.571392       2    CREATED   ENABLED   CREATED
1 2019-11-25 22:20:59.150339       2    ENABLED  DISABLED   ENABLED
2 2019-11-26 10:22:36.571392       2   DISABLED   ENABLED  DISABLED
3 2019-11-14 14:57:02.571392       3    CREATED   ENABLED   CREATED
4 2019-12-06 16:03:44.255603       3    ENABLED  DISABLED   ENABLED
5 2019-12-12 12:50:48.571392       3   DISABLED   ENABLED  DISABLED

m = df['from_state'].eq(df['to_state1'])
print (m)
0    True
1    True
2    True
3    True
4    True
5    True
dtype: bool

df['total_time'] = (df[m].groupby('act_id')['started'].diff()/np.timedelta64(1, 'D')).fillna(0)
print (df)
                     started  act_id from_state  to_state to_state1  \
0 2019-11-06 05:49:39.571392       2    CREATED   ENABLED   CREATED   
1 2019-11-25 22:20:59.150339       2    ENABLED  DISABLED   ENABLED   
2 2019-11-26 10:22:36.571392       2   DISABLED   ENABLED  DISABLED   
3 2019-11-14 14:57:02.571392       3    CREATED   ENABLED   CREATED   
4 2019-12-06 16:03:44.255603       3    ENABLED  DISABLED   ENABLED   
5 2019-12-12 12:50:48.571392       3   DISABLED   ENABLED  DISABLED   

   total_time  
0    0.000000  
1   19.688421  
2    0.501128  
3    0.000000  
4   22.046316  
5    5.866022  
df['started'] = pd.to_datetime(df['started'])

df = df.merge(pd.DataFrame( pd.DataFrame( df.groupby(['act_id', 'from_state', 'to_state']).count())), how='outer', indicator=False, on=['act_id', 'from_state', 'to_state']  )

You might need to rename your dataframe after merge accordingly. Hope this would give you the answer

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM