简体   繁体   中英

Python - How to compare the values for row and previous row and return time difference based on condition?

I have dataset that contians TimeStamp, CustomerID and Session ID. As you can see the clientId and Session_ID do repeat over the time.

Timestamp   clientId    Session_ID
0   6/12/2021 15:05 27255667    ab89
1   6/12/2021 19:56 118698247   684a
2   6/12/2021 23:59 99492237    a4fd
3   6/12/2021 23:59 99492237    a4fd
4   6/12/2021 23:59 99492237    a4fd
5   6/12/2021 23:59 99492237    a4fd
6   6/13/2021 0:06  99492237    a5fd
7   6/13/2021 0:41  142584462   c23000
8   6/13/2021 23:33 142584462   c23000
9   6/13/2021 23:33 142584462   c23000
10  6/13/2021 23:33 142584462   c23000
11  6/13/2021 23:34 142584462   c23000
12  6/13/2021 23:34 142584462   c23000
13  6/13/2021 23:34 142584462   7d97

I need to find the instnaces where clientId gets new SessionID and then i need to calculate the time differnce between previous Session_ID and the new Session_Id for the same client.

For example client_ID:99492237 had 3 same Session and then 4th one was different. The time difference would be 6/13/2021 0:06 - 6/12/2021 23:59

This is what I tried so far:

# importing dependencies
import pandas as pd
import numpy as np
# importing data from SampleData.csv
df = pd.read_csv('data/SampleData.csv' ,converters={'clientId':str})
#converting timestamp string into timestap datatype
df["Time"] = pd.to_datetime(df["Timestamp"])
df.sort_values(["Timestamp", "clientId"], ascending = (True, True))
df["SameSession?"] = np.where((df['clientId'] == df['clientId'].shift(-1)) & (df['Session_ID'] == df['Session_ID'].shift(-1)), "YES", "NO")
minMax = df.groupby('clientId').agg(minTime=('Time', 'min'), maxTime=('Time', 'max'))
minMax['Diff'] = minMax['maxTime'] - minMax['minTime']
df = df.merge(minMax[['Diff']], on='clientId')

So I tried sorting everything by Time and ClientID so I can get same ClientId one after another. Than i tried comparing the row to row above for ClientID and Session. IF both are same, return Yes,if one is different than show NO. And then i did calculation between First and Last instace of ClientID. But i am getting wrong values when i compare rows to row above.

Here is the output

        Timestamp   clientId    Session_ID  Time    SameSession?    Diff
0   6/12/2021 15:05 27255667    ab89    2021-06-12 15:05:00 NO  0 days 00:00:00
1   6/12/2021 19:56 118698247   684a    2021-06-12 19:56:00 NO  0 days 00:00:00
2   6/12/2021 23:59 99492237    a4fd    2021-06-12 23:59:00 YES 0 days 00:07:00
3   6/12/2021 23:59 99492237    a4fd    2021-06-12 23:59:00 YES 0 days 00:07:00
4   6/12/2021 23:59 99492237    a4fd    2021-06-12 23:59:00 YES 0 days 00:07:00
5   6/12/2021 23:59 99492237    a4fd    2021-06-12 23:59:00 NO  0 days 00:07:00
6   6/13/2021 0:06  99492237    a5fd    2021-06-13 00:06:00 NO  0 days 00:07:00
7   6/13/2021 0:41  142584462   c23000  2021-06-13 00:41:00 YES 0 days 22:53:00
8   6/13/2021 23:33 142584462   c23000  2021-06-13 23:33:00 YES 0 days 22:53:00
9   6/13/2021 23:33 142584462   c23000  2021-06-13 23:33:00 YES 0 days 22:53:00
10  6/13/2021 23:33 142584462   c23000  2021-06-13 23:33:00 YES 0 days 22:53:00
11  6/13/2021 23:34 142584462   c23000  2021-06-13 23:34:00 YES 0 days 22:53:00
12  6/13/2021 23:34 142584462   c23000  2021-06-13 23:34:00 NO  0 days 22:53:00
13  6/13/2021 23:34 142584462   7d97    2021-06-13 23:34:00 NO  0 days 22:53:00

SO how do I need to adjust my code so i can show the new Session_ID for the same ClientId and show the time difference between new Session and previous session for the same client?

The output i expect to get should look like this:

            Timestamp   clientId    Session_ID  Time    SameSession?    Diff
0   6/12/2021 15:05 27255667    ab89    2021-06-12 15:05:00 NO  0 days 
1   6/12/2021 19:56 118698247   684a    2021-06-12 19:56:00 NO  0 days 
2   6/12/2021 23:59 99492237    a4fd    2021-06-12 23:59:00 YES 0 days 
3   6/12/2021 23:59 99492237    a4fd    2021-06-12 23:59:00 YES 0 days 
4   6/12/2021 23:59 99492237    a4fd    2021-06-12 23:59:00 YES 0 days 
5   6/12/2021 23:59 99492237    a4fd    2021-06-12 23:59:00 YES 0 days 
6   6/13/2021 0:06  99492237    a5fd    2021-06-13 00:06:00 NO  0 days 00:07:00
7   6/13/2021 0:41  142584462   c23000  2021-06-13 00:41:00 NO  0 days 
8   6/13/2021 23:33 142584462   c23000  2021-06-13 23:33:00 YES 0 days 
9   6/13/2021 23:33 142584462   c23000  2021-06-13 23:33:00 YES 0 days 
10  6/13/2021 23:33 142584462   c23000  2021-06-13 23:33:00 YES 0 days 
11  6/13/2021 23:34 142584462   c23000  2021-06-13 23:34:00 YES 0 days 
12  6/13/2021 23:34 142584462   c23000  2021-06-13 23:34:00 YES 0 days 
13  6/13/2021 23:34 142584462   7d97    2021-06-13 23:34:00 NO  0 days 22:53:00

To get correct values in SameSession you need shift(1) instead of shift(-1)


Your Diff has type Timedelta and you would need to convert it to string to get it in excel .


I would create column with default value 0 days

 df['Diff'] = '0 days'

and later work with every group to put new value only in last row in group

groups = df.groupby('clientId')

for val, group in groups:
    diff = group['Time'].max() - group['Time'].min()

    last_index = group.index[-1]

    if diff.total_seconds() > 0:
         df.loc[last_row,'Diff'] = str(diff)
    #else:        
    #    df.loc[last_row,'Diff'] = '0 days'

Minimal working code

text = '''Timestamp,clientId,Session_ID
6/12/2021 15:05,27255667,ab89
6/12/2021 19:56,118698247,684a
6/12/2021 23:59,99492237,a4fd
6/12/2021 23:59,99492237,a4fd
6/12/2021 23:59,99492237,a4fd
6/12/2021 23:59,99492237,a4fd
6/13/2021 00:06,99492237,a5fd
6/13/2021 00:41,142584462,c23000
6/13/2021 23:33,142584462,c23000
6/13/2021 23:33,142584462,c23000
6/13/2021 23:33,142584462,c23000
6/13/2021 23:34,142584462,c23000
6/13/2021 23:34,142584462,c23000
6/13/2021 23:34,142584462,7d97'''

import pandas as pd
import numpy as np
import io

# importing data from SampleData.csv
#df = pd.read_csv('data/SampleData.csv', converters={'clientId':str})
df = pd.read_csv(io.StringIO(text), converters={'clientId':str})
#print(df)

df["Time"] = pd.to_datetime(df["Timestamp"])
df.sort_values(["Timestamp", "clientId"], ascending = (True, True))
df["SameSession?"] = np.where((df['clientId'] == df['clientId'].shift(1)) & (df['Session_ID'] == df['Session_ID'].shift(1)), "YES", "NO")

# create column with default value
df['Diff'] = '0 days'
#print(df)

groups = df.groupby('clientId')

for val, group in groups:
    diff = group['Time'].max() - group['Time'].min()

    last_index = group.index[-1]
    
    if diff.total_seconds() > 0:
         df.loc[last_index,'Diff'] = str(diff)
    #else:        
    #    df.loc[last_index,'Diff'] = '0 days'
        
print(df[['clientId', 'Diff']])

Result:

     clientId             Diff
0    27255667           0 days
1   118698247           0 days
2    99492237           0 days
3    99492237           0 days
4    99492237           0 days
5    99492237           0 days
6    99492237  0 days 00:07:00
7   142584462           0 days
8   142584462           0 days
9   142584462           0 days
10  142584462           0 days
11  142584462           0 days
12  142584462           0 days
13  142584462  0 days 22:53:00

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM