简体   繁体   中英

How to assign a unique ID for different groups in pandas dataframe?

How to assign unique IDs to groups created in pandas dataframe based on certain conditions. For example: I have a dataframe named as df with the following structure:Name identifies the user, and datetime identifies the date/time at which the user is accessing a resource.

Name         Datetime 
Bob          26-04-2018 12:00:00 
Claire       26-04-2018 12:00:00 
Bob          26-04-2018 12:10:00 
Bob          26-04-2018 12:30:00 
Grace        27-04-2018 08:30:00 
Bob          27-04-2018 09:30:00 
Bob          27-04-2018 09:40:00 
Bob          27-04-2018 10:00:00 
Bob          27-04-2018 10:30:00 
Bob          27-04-2018 11:30:00

I would like to create sessions for the users such that, users with same name and datetime values accessing the resource do not exceed more than 30 minutes would be assigned a unique session. However, if the user shows some inactivity for more than 30 minutes in accessing the resource, the same user would be assigned a different session for the next time user access the resource.

My expected output would be as shown.

User Bob on 27-04-2018, accessed the resource at 9.30, second time @ 9.40, third time @ 10.00, fourth time @10.30 -> all with Session 4. But next time user Bob access @ 11.30 so time difference exceeds 30 minutes as Bob has been inactive for more than 30 minutes, so next session would be assigned to him.

Name         Datetime                    Id
Bob          26-04-2018 12:00:00          1
Claire       26-04-2018 12:00:00          2
Bob          26-04-2018 12:10:00          1
Bob          26-04-2018 12:30:00          1
Grace        27-04-2018 08:30:00          3
Bob          27-04-2018 09:30:00          4
Bob          27-04-2018 09:40:00          4
Bob          27-04-2018 10:00:00          4
Bob          27-04-2018 10:30:00          4
Bob          27-04-2018 11:30:00          5

Thank you for your help! Link to previous question: How to compare value of second column with same values of first column in pandas dataframe?

sort and find the time difference ( 'td' ) for successive actions. cumsum a Boolean Series to form groups of successive actions within 30 minutes of the last. ngroup labels the groups.

The sort_index before the groupby can be removed if you don't care which label the groups get, but this ensures they're ordered based on the original order.

df = df.sort_values(['Name', 'Datetime'])
df['td'] = df.Datetime.diff().mask(df.Name.ne(df.Name.shift()))
                             # Only calculate diff within same Name
df['Id'] = (df.sort_index()
              .groupby(['Name', df['td'].gt(pd.Timedelta('30min')).cumsum()], sort=False)
              .ngroup()+1)
df = df.sort_index()

Output:

td left in for clarity

     Name            Datetime       td  Id
0     Bob 2018-04-26 12:00:00      NaT   1
1  Claire 2018-04-26 12:00:00      NaT   2
2     Bob 2018-04-26 12:10:00 00:10:00   1
3     Bob 2018-04-26 12:30:00 00:20:00   1
4   Grace 2018-04-27 08:30:00      NaT   3
5     Bob 2018-04-27 09:30:00 21:00:00   4
6     Bob 2018-04-27 09:40:00 00:10:00   4
7     Bob 2018-04-27 10:00:00 00:20:00   4
8     Bob 2018-04-27 10:30:00 00:30:00   4
9     Bob 2018-04-27 11:30:00 01:00:00   5

Your explanation at the near bottom is really helpful to understand it.

You need to groupby on Name and a groupID (don't confuse this groupID with your final Id ) and call ngroup to return Id . The main thing is how to define this groupID . To create groupID , you need sort_values to separate each Name and Datetime into ascending order. Groupby Name and find differences in Datetime between consecutive rows within each group of Name (within the same Name ). Using gt to check greater than 30mins and cumsum to get groupID . sort_index to reverse back to original order and assign to s as follows:

s = df.sort_values(['Name','Datetime']).groupby('Name').Datetime.diff() \
      .gt(pd.Timedelta(minutes=30)).cumsum().sort_index()

Next, groupby Name and s with sort=False to reserve the original order and call ngroup plus 1.

df['Id'] = df.groupby(['Name', s], sort=False).ngroup().add(1)

Out[834]:
     Name            Datetime  Id
0     Bob 2018-04-26 12:00:00   1
1  Claire 2018-04-26 12:00:00   2
2     Bob 2018-04-26 12:10:00   1
3     Bob 2018-04-26 12:30:00   1
4   Grace 2018-04-27 08:30:00   3
5     Bob 2018-04-27 09:30:00   4
6     Bob 2018-04-27 09:40:00   4
7     Bob 2018-04-27 10:00:00   4
8     Bob 2018-04-27 10:30:00   4
9     Bob 2018-04-27 11:30:00   5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM