I'm creating a Python program using pandas and datetime libraries that will calculate the pay from my casual job each week, so I can cross reference my bank statement instead of looking through payslips. The data that I am analysing is from the Google Calendar API that is synced with my work schedule. It prints the events in that particular calendar to a csv file in this format:
Start | End | Title | Hours | |
---|---|---|---|---|
0 | 02.12.2020 07:00 | 02.12.2020 16:00 | Shift | 9.0 |
1 | 04.12.2020 18:00 | 04.12.2020 21:00 | Shift | 3.0 |
2 | 05.12.2020 07:00 | 05.12.2020 12:00 | Shift | 5.0 |
3 | 06.12.2020 09:00 | 06.12.2020 18:00 | Shift | 9.0 |
4 | 07.12.2020 19:00 | 07.12.2020 23:00 | Shift | 4.0 |
5 | 08.12.2020 19:00 | 08.12.2020 23:00 | Shift | 4.0 |
6 | 09.12.2020 10:00 | 09.12.2020 15:00 | Shift | 5.0 |
As I am a casual at this job I have to take a few things into consideration like penalty rates (baserate, after 6pm on Monday - Friday, Saturday, and Sunday all have different rates). I'm wondering if I can analyse this csv using datetime and calculate how many hours are before 6pm, and how many after 6pm. So using this as an example the output would be like:
Start | End | Title | Hours | |
---|---|---|---|---|
1 | 04.12.2020 15:00 | 04.12.2020 21:00 | Shift | 6.0 |
Start | End | Title | Total Hours | Hours before 3pm | Hours after 3pm | |
---|---|---|---|---|---|---|
1 | 04.12.2020 15:00 | 04.12.2020 21:00 | Shift | 6.0 | 3.0 | 3.0 |
I can use this to get the day of the week but I'm just not sure how to analyse certain bits of time for penalty rates:
df['day_of_week'] = df['Start'].dt.day_name()
I appreciate any help in Python or even other coding languages/techniques this can be applied to:)
Edit: This is how my dataframe is looking at the moment
Start | End | Title | Hours | day_of_week | Pay | week_of_year | |
---|---|---|---|---|---|---|---|
0 | 2020-12-02 07:00:00 | 2020-12-02 16:00:00 | Shift | 9.0 | Wednesday | 337.30 | 49 |
EDIT In response to David Erickson's comment.
value | variable | bool | |
---|---|---|---|
0 | 2020-12-02 07:00:00 | Start | False |
1 | 2020-12-02 08:00:00 | Start | False |
2 | 2020-12-02 09:00:00 | Start | False |
3 | 2020-12-02 10:00:00 | Start | False |
4 | 2020-12-02 11:00:00 | Start | False |
5 | 2020-12-02 12:00:00 | Start | False |
6 | 2020-12-02 13:00:00 | Start | False |
7 | 2020-12-02 14:00:00 | Start | False |
8 | 2020-12-02 15:00:00 | Start | False |
9 | 2020-12-02 16:00:00 | End | False |
10 | 2020-12-04 18:00:00 | Start | False |
11 | 2020-12-04 19:00:00 | Start | True |
12 | 2020-12-04 20:00:00 | Start | True |
13 | 2020-12-04 21:00:00 | End | True |
14 | 2020-12-05 07:00:00 | Start | False |
15 | 2020-12-05 08:00:00 | Start | False |
16 | 2020-12-05 09:00:00 | Start | False |
17 | 2020-12-05 10:00:00 | Start | False |
18 | 2020-12-05 11:00:00 | Start | False |
19 | 2020-12-05 12:00:00 | End | False |
20 | 2020-12-06 09:00:00 | Start | False |
21 | 2020-12-06 10:00:00 | Start | False |
22 | 2020-12-06 11:00:00 | Start | False |
23 | 2020-12-06 12:00:00 | Start | False |
24 | 2020-12-06 13:00:00 | Start | False |
25 | 2020-12-06 14:00:00 | Start | False |
26 | 2020-12-06 15:00:00 | Start | False |
27 | 2020-12-06 6:00:00 | Start | False |
28 | 2020-12-06 17:00:00 | Start | False |
29 | 2020-12-06 18:00:00 | End | False |
30 | 2020-12-07 19:00:00 | Start | False |
31 | 2020-12-07 20:00:00 | Start | True |
32 | 2020-12-07 21:00:00 | Start | True |
33 | 2020-12-07 22:00:00 | Start | True |
34 | 2020-12-07 23:00:00 | End | True |
35 | 2020-12-08 19:00:00 | Start | False |
36 | 2020-12-08 20:00:00 | Start | True |
37 | 2020-12-08 21:00:00 | Start | True |
38 | 2020-12-08 22:00:00 | Start | True |
39 | 2020-12-08 23:00:00 | End | True |
40 | 2020-12-09 10:00:00 | Start | False |
41 | 2020-12-09 11:00:00 | Start | False |
42 | 2020-12-09 12:00:00 | Start | False |
43 | 2020-12-09 13:00:00 | Start | False |
44 | 2020-12-09 14:00:00 | Start | False |
45 | 2020-12-09 15:00:00 | End | False |
46 | 2020-12-11 19:00:00 | Start | False |
47 | 2020-12-11 20:00:00 | Start | True |
48 | 2020-12-11 21:00:00 | Start | True |
49 | 2020-12-11 22:00:00 | Start | True |
UPDATE: (2020-12-19)
I have simply filtered out the Start
rows, as you were correct an extra row wa being calculated. Also, I passed dayfirst=True
to pd.to_datetime()
to convert the date correctly. I have also made the output clean with some extra columns.
higher_pay = 40
lower_pay = 30
df['Start'], df['End'] = pd.to_datetime(df['Start'], dayfirst=True), pd.to_datetime(df['End'], dayfirst=True)
start = df['Start']
df1 = df[['Start', 'End']].melt(value_name='Date').set_index('Date')
s = df1.groupby('variable').cumcount()
df1 = df1.groupby(s, group_keys=False).resample('1H').asfreq().join(s.rename('Shift').to_frame()).ffill().reset_index()
df1 = df1[~df1['Date'].isin(start)]
df1['Day'] = df1['Date'].dt.day_name()
df1['Week'] = df1['Date'].dt.isocalendar().week
m = (df1['Date'].dt.hour > 18) | (df1['Day'].isin(['Saturday', 'Sunday']))
df1['Higher Pay Hours'] = np.where(m, 1, 0)
df1['Lower Pay Hours'] = np.where(m, 0, 1)
df1['Pay'] = np.where(m, higher_pay, lower_pay)
df1 = df1.groupby(['Shift', 'Day', 'Week']).sum().reset_index()
df2 = df.merge(df1, how='left', left_index=True, right_on='Shift').drop('Shift', axis=1)
df2
Out[1]:
Start End Title Hours Day Week \
0 2020-12-02 07:00:00 2020-12-02 16:00:00 Shift 9.0 Wednesday 49
1 2020-12-04 18:00:00 2020-12-04 21:00:00 Shift 3.0 Friday 49
2 2020-12-05 07:00:00 2020-12-05 12:00:00 Shift 5.0 Saturday 49
3 2020-12-06 09:00:00 2020-12-06 18:00:00 Shift 9.0 Sunday 49
4 2020-12-07 19:00:00 2020-12-07 23:00:00 Shift 4.0 Monday 50
5 2020-12-08 19:00:00 2020-12-08 23:00:00 Shift 4.0 Tuesday 50
6 2020-12-09 10:00:00 2020-12-09 15:00:00 Shift 5.0 Wednesday 50
Higher Pay Hours Lower Pay Hours Pay
0 0 9 270
1 3 0 120
2 5 0 200
3 9 0 360
4 4 0 160
5 4 0 160
6 0 5 150
There are probably more concise ways to do this, but I thought resampling the dataframe and then counting the hours would be a clean approach. You can melt
the dataframe to have Start
and End
in the same column and fill in the gap hours with resample
making sure to groupby
by the 'Start' and 'End' values that were initially on the same row. The easiest way to figure out which rows were initially together is to get the cumulative count with cumcount
of the values in the new the dataframe grouped by 'Start' and 'End'. I'll show you how this works later in the answer.
Full Code:
df['Start'], df['End'] = pd.to_datetime(df['Start']), pd.to_datetime(df['End'])
df = df[['Start', 'End']].melt().set_index('value')
df = df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26
Adding some more details...
Step 1: Melt the dataframe: You only need two columns 'Start' and 'End' to get your desired output
df = df[['Start', 'End']].melt().set_index('value')
df
Out[1]:
variable
value
2020-02-12 07:00:00 Start
2020-04-12 18:00:00 Start
2020-05-12 07:00:00 Start
2020-06-12 09:00:00 Start
2020-07-12 19:00:00 Start
2020-08-12 19:00:00 Start
2020-09-12 10:00:00 Start
2020-02-12 16:00:00 End
2020-04-12 21:00:00 End
2020-05-12 12:00:00 End
2020-06-12 18:00:00 End
2020-07-12 23:00:00 End
2020-08-12 23:00:00 End
2020-09-12 15:00:00 End
Step 2: Create Group in preparation for resample: *As you can see group 0-6 line up with each other representing ' Start' and 'End' as they were together previously
df.groupby('variable').cumcount()
Out[2]:
value
2020-02-12 07:00:00 0
2020-04-12 18:00:00 1
2020-05-12 07:00:00 2
2020-06-12 09:00:00 3
2020-07-12 19:00:00 4
2020-08-12 19:00:00 5
2020-09-12 10:00:00 6
2020-02-12 16:00:00 0
2020-04-12 21:00:00 1
2020-05-12 12:00:00 2
2020-06-12 18:00:00 3
2020-07-12 23:00:00 4
2020-08-12 23:00:00 5
2020-09-12 15:00:00 6
Step 3: Resample the data per group by hour to fill in the gaps for each group:
df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
Out[3]:
value variable
0 2020-02-12 07:00:00 Start
1 2020-02-12 08:00:00 Start
2 2020-02-12 09:00:00 Start
3 2020-02-12 10:00:00 Start
4 2020-02-12 11:00:00 Start
5 2020-02-12 12:00:00 Start
6 2020-02-12 13:00:00 Start
7 2020-02-12 14:00:00 Start
8 2020-02-12 15:00:00 Start
9 2020-02-12 16:00:00 End
10 2020-04-12 18:00:00 Start
11 2020-04-12 19:00:00 Start
12 2020-04-12 20:00:00 Start
13 2020-04-12 21:00:00 End
14 2020-05-12 07:00:00 Start
15 2020-05-12 08:00:00 Start
16 2020-05-12 09:00:00 Start
17 2020-05-12 10:00:00 Start
18 2020-05-12 11:00:00 Start
19 2020-05-12 12:00:00 End
20 2020-06-12 09:00:00 Start
21 2020-06-12 10:00:00 Start
22 2020-06-12 11:00:00 Start
23 2020-06-12 12:00:00 Start
24 2020-06-12 13:00:00 Start
25 2020-06-12 14:00:00 Start
26 2020-06-12 15:00:00 Start
27 2020-06-12 16:00:00 Start
28 2020-06-12 17:00:00 Start
29 2020-06-12 18:00:00 End
30 2020-07-12 19:00:00 Start
31 2020-07-12 20:00:00 Start
32 2020-07-12 21:00:00 Start
33 2020-07-12 22:00:00 Start
34 2020-07-12 23:00:00 End
35 2020-08-12 19:00:00 Start
36 2020-08-12 20:00:00 Start
37 2020-08-12 21:00:00 Start
38 2020-08-12 22:00:00 Start
39 2020-08-12 23:00:00 End
40 2020-09-12 10:00:00 Start
41 2020-09-12 11:00:00 Start
42 2020-09-12 12:00:00 Start
43 2020-09-12 13:00:00 Start
44 2020-09-12 14:00:00 Start
45 2020-09-12 15:00:00 End
Step 4 - From there, you can calculate the boolean series I have called m
: *True values represent conditions met for "Higher Rate".
m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
m
Out[4]:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 True
11 True
12 True
13 True
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
30 True
31 True
32 True
33 True
34 True
35 True
36 True
37 True
38 True
39 True
40 True
41 True
42 True
43 True
44 True
45 True
Step 5: Filter the dataframe by True
or False
to count total hours for the Normal Rate and Higher Rate and print values.
print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.