Find whether an event (with a start & end time) goes beyond a certain time (e.g. 6pm) in dataframe using Python (pandas, datetime)

Question

I'm creating a Python program using pandas and datetime libraries that will calculate the pay from my casual job each week, so I can cross reference my bank statement instead of looking through payslips. The data that I am analysing is from the Google Calendar API that is synced with my work schedule. It prints the events in that particular calendar to a csv file in this format:

	Start	End	Title	Hours
0	02.12.2020 07:00	02.12.2020 16:00	Shift	9.0
1	04.12.2020 18:00	04.12.2020 21:00	Shift	3.0
2	05.12.2020 07:00	05.12.2020 12:00	Shift	5.0
3	06.12.2020 09:00	06.12.2020 18:00	Shift	9.0
4	07.12.2020 19:00	07.12.2020 23:00	Shift	4.0
5	08.12.2020 19:00	08.12.2020 23:00	Shift	4.0
6	09.12.2020 10:00	09.12.2020 15:00	Shift	5.0

As I am a casual at this job I have to take a few things into consideration like penalty rates (baserate, after 6pm on Monday - Friday, Saturday, and Sunday all have different rates). I'm wondering if I can analyse this csv using datetime and calculate how many hours are before 6pm, and how many after 6pm. So using this as an example the output would be like:

	Start	End	Title	Hours
1	04.12.2020 15:00	04.12.2020 21:00	Shift	6.0

	Start	End	Title	Total Hours	Hours before 3pm	Hours after 3pm
1	04.12.2020 15:00	04.12.2020 21:00	Shift	6.0	3.0	3.0

I can use this to get the day of the week but I'm just not sure how to analyse certain bits of time for penalty rates:


df['day_of_week'] = df['Start'].dt.day_name()

I appreciate any help in Python or even other coding languages/techniques this can be applied to:)

Edit: This is how my dataframe is looking at the moment

	Start	End	Title	Hours	day_of_week	Pay	week_of_year
0	2020-12-02 07:00:00	2020-12-02 16:00:00	Shift	9.0	Wednesday	337.30	49

EDIT In response to David Erickson's comment.

	value	variable	bool
0	2020-12-02 07:00:00	Start	False
1	2020-12-02 08:00:00	Start	False
2	2020-12-02 09:00:00	Start	False
3	2020-12-02 10:00:00	Start	False
4	2020-12-02 11:00:00	Start	False
5	2020-12-02 12:00:00	Start	False
6	2020-12-02 13:00:00	Start	False
7	2020-12-02 14:00:00	Start	False
8	2020-12-02 15:00:00	Start	False
9	2020-12-02 16:00:00	End	False
10	2020-12-04 18:00:00	Start	False
11	2020-12-04 19:00:00	Start	True
12	2020-12-04 20:00:00	Start	True
13	2020-12-04 21:00:00	End	True
14	2020-12-05 07:00:00	Start	False
15	2020-12-05 08:00:00	Start	False
16	2020-12-05 09:00:00	Start	False
17	2020-12-05 10:00:00	Start	False
18	2020-12-05 11:00:00	Start	False
19	2020-12-05 12:00:00	End	False
20	2020-12-06 09:00:00	Start	False
21	2020-12-06 10:00:00	Start	False
22	2020-12-06 11:00:00	Start	False
23	2020-12-06 12:00:00	Start	False
24	2020-12-06 13:00:00	Start	False
25	2020-12-06 14:00:00	Start	False
26	2020-12-06 15:00:00	Start	False
27	2020-12-06 6:00:00	Start	False
28	2020-12-06 17:00:00	Start	False
29	2020-12-06 18:00:00	End	False
30	2020-12-07 19:00:00	Start	False
31	2020-12-07 20:00:00	Start	True
32	2020-12-07 21:00:00	Start	True
33	2020-12-07 22:00:00	Start	True
34	2020-12-07 23:00:00	End	True
35	2020-12-08 19:00:00	Start	False
36	2020-12-08 20:00:00	Start	True
37	2020-12-08 21:00:00	Start	True
38	2020-12-08 22:00:00	Start	True
39	2020-12-08 23:00:00	End	True
40	2020-12-09 10:00:00	Start	False
41	2020-12-09 11:00:00	Start	False
42	2020-12-09 12:00:00	Start	False
43	2020-12-09 13:00:00	Start	False
44	2020-12-09 14:00:00	Start	False
45	2020-12-09 15:00:00	End	False
46	2020-12-11 19:00:00	Start	False
47	2020-12-11 20:00:00	Start	True
48	2020-12-11 21:00:00	Start	True
49	2020-12-11 22:00:00	Start	True

Answer 1

UPDATE: (2020-12-19)

I have simply filtered out the Start rows, as you were correct an extra row wa being calculated. Also, I passed dayfirst=True to pd.to_datetime() to convert the date correctly. I have also made the output clean with some extra columns.

higher_pay = 40
lower_pay = 30

df['Start'], df['End'] = pd.to_datetime(df['Start'], dayfirst=True), pd.to_datetime(df['End'], dayfirst=True)
start = df['Start']
df1 = df[['Start', 'End']].melt(value_name='Date').set_index('Date')
s = df1.groupby('variable').cumcount()
df1 = df1.groupby(s, group_keys=False).resample('1H').asfreq().join(s.rename('Shift').to_frame()).ffill().reset_index()
df1 = df1[~df1['Date'].isin(start)]
df1['Day'] = df1['Date'].dt.day_name()
df1['Week'] = df1['Date'].dt.isocalendar().week
m = (df1['Date'].dt.hour > 18) | (df1['Day'].isin(['Saturday', 'Sunday']))
df1['Higher Pay Hours'] = np.where(m, 1, 0)
df1['Lower Pay Hours'] = np.where(m, 0, 1)
df1['Pay'] = np.where(m, higher_pay, lower_pay)
df1 = df1.groupby(['Shift', 'Day', 'Week']).sum().reset_index()
df2 = df.merge(df1, how='left', left_index=True, right_on='Shift').drop('Shift', axis=1)
df2
Out[1]: 
                Start                 End  Title  Hours        Day  Week  \
0 2020-12-02 07:00:00 2020-12-02 16:00:00  Shift    9.0  Wednesday    49   
1 2020-12-04 18:00:00 2020-12-04 21:00:00  Shift    3.0     Friday    49   
2 2020-12-05 07:00:00 2020-12-05 12:00:00  Shift    5.0   Saturday    49   
3 2020-12-06 09:00:00 2020-12-06 18:00:00  Shift    9.0     Sunday    49   
4 2020-12-07 19:00:00 2020-12-07 23:00:00  Shift    4.0     Monday    50   
5 2020-12-08 19:00:00 2020-12-08 23:00:00  Shift    4.0    Tuesday    50   
6 2020-12-09 10:00:00 2020-12-09 15:00:00  Shift    5.0  Wednesday    50   

   Higher Pay Hours  Lower Pay Hours  Pay  
0                 0                9  270  
1                 3                0  120  
2                 5                0  200  
3                 9                0  360  
4                 4                0  160  
5                 4                0  160  
6                 0                5  150

There are probably more concise ways to do this, but I thought resampling the dataframe and then counting the hours would be a clean approach. You can melt the dataframe to have Start and End in the same column and fill in the gap hours with resample making sure to groupby by the 'Start' and 'End' values that were initially on the same row. The easiest way to figure out which rows were initially together is to get the cumulative count with cumcount of the values in the new the dataframe grouped by 'Start' and 'End'. I'll show you how this works later in the answer.

Full Code:

df['Start'], df['End'] = pd.to_datetime(df['Start']), pd.to_datetime(df['End'])
df = df[['Start', 'End']].melt().set_index('value')
df = df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26

Adding some more details...

Step 1: Melt the dataframe: You only need two columns 'Start' and 'End' to get your desired output

df = df[['Start', 'End']].melt().set_index('value')
df
Out[1]: 
                    variable
value                       
2020-02-12 07:00:00    Start
2020-04-12 18:00:00    Start
2020-05-12 07:00:00    Start
2020-06-12 09:00:00    Start
2020-07-12 19:00:00    Start
2020-08-12 19:00:00    Start
2020-09-12 10:00:00    Start
2020-02-12 16:00:00      End
2020-04-12 21:00:00      End
2020-05-12 12:00:00      End
2020-06-12 18:00:00      End
2020-07-12 23:00:00      End
2020-08-12 23:00:00      End
2020-09-12 15:00:00      End

Step 2: Create Group in preparation for resample: *As you can see group 0-6 line up with each other representing ' Start' and 'End' as they were together previously

df.groupby('variable').cumcount()
Out[2]: 
value
2020-02-12 07:00:00    0
2020-04-12 18:00:00    1
2020-05-12 07:00:00    2
2020-06-12 09:00:00    3
2020-07-12 19:00:00    4
2020-08-12 19:00:00    5
2020-09-12 10:00:00    6
2020-02-12 16:00:00    0
2020-04-12 21:00:00    1
2020-05-12 12:00:00    2
2020-06-12 18:00:00    3
2020-07-12 23:00:00    4
2020-08-12 23:00:00    5
2020-09-12 15:00:00    6

Step 3: Resample the data per group by hour to fill in the gaps for each group:

df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
Out[3]: 
                 value variable
0  2020-02-12 07:00:00    Start
1  2020-02-12 08:00:00    Start
2  2020-02-12 09:00:00    Start
3  2020-02-12 10:00:00    Start
4  2020-02-12 11:00:00    Start
5  2020-02-12 12:00:00    Start
6  2020-02-12 13:00:00    Start
7  2020-02-12 14:00:00    Start
8  2020-02-12 15:00:00    Start
9  2020-02-12 16:00:00      End
10 2020-04-12 18:00:00    Start
11 2020-04-12 19:00:00    Start
12 2020-04-12 20:00:00    Start
13 2020-04-12 21:00:00      End
14 2020-05-12 07:00:00    Start
15 2020-05-12 08:00:00    Start
16 2020-05-12 09:00:00    Start
17 2020-05-12 10:00:00    Start
18 2020-05-12 11:00:00    Start
19 2020-05-12 12:00:00      End
20 2020-06-12 09:00:00    Start
21 2020-06-12 10:00:00    Start
22 2020-06-12 11:00:00    Start
23 2020-06-12 12:00:00    Start
24 2020-06-12 13:00:00    Start
25 2020-06-12 14:00:00    Start
26 2020-06-12 15:00:00    Start
27 2020-06-12 16:00:00    Start
28 2020-06-12 17:00:00    Start
29 2020-06-12 18:00:00      End
30 2020-07-12 19:00:00    Start
31 2020-07-12 20:00:00    Start
32 2020-07-12 21:00:00    Start
33 2020-07-12 22:00:00    Start
34 2020-07-12 23:00:00      End
35 2020-08-12 19:00:00    Start
36 2020-08-12 20:00:00    Start
37 2020-08-12 21:00:00    Start
38 2020-08-12 22:00:00    Start
39 2020-08-12 23:00:00      End
40 2020-09-12 10:00:00    Start
41 2020-09-12 11:00:00    Start
42 2020-09-12 12:00:00    Start
43 2020-09-12 13:00:00    Start
44 2020-09-12 14:00:00    Start
45 2020-09-12 15:00:00      End

Step 4 - From there, you can calculate the boolean series I have called m : *True values represent conditions met for "Higher Rate".

m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
m
Out[4]: 
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10     True
11     True
12     True
13     True
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30     True
31     True
32     True
33     True
34     True
35     True
36     True
37     True
38     True
39     True
40     True
41     True
42     True
43     True
44     True
45     True

Step 5: Filter the dataframe by True or False to count total hours for the Normal Rate and Higher Rate and print values.

print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26

Find whether an event (with a start & end time) goes beyond a certain time (e.g. 6pm) in dataframe using Python (pandas, datetime)

Question

1 answers

solution1
1 2020-12-12 06:23:32

Find whether an event (with a start & end time) goes beyond a certain time (e.g. 6pm) in dataframe using Python (pandas, datetime)

Question

1 answers

solution1 1 2020-12-12 06:23:32

solution1
1 2020-12-12 06:23:32