简体   繁体   中英

Find whether an event (with a start & end time) goes beyond a certain time (e.g. 6pm) in dataframe using Python (pandas, datetime)

I'm creating a Python program using pandas and datetime libraries that will calculate the pay from my casual job each week, so I can cross reference my bank statement instead of looking through payslips. The data that I am analysing is from the Google Calendar API that is synced with my work schedule. It prints the events in that particular calendar to a csv file in this format:

Start End Title Hours
0 02.12.2020 07:00 02.12.2020 16:00 Shift 9.0
1 04.12.2020 18:00 04.12.2020 21:00 Shift 3.0
2 05.12.2020 07:00 05.12.2020 12:00 Shift 5.0
3 06.12.2020 09:00 06.12.2020 18:00 Shift 9.0
4 07.12.2020 19:00 07.12.2020 23:00 Shift 4.0
5 08.12.2020 19:00 08.12.2020 23:00 Shift 4.0
6 09.12.2020 10:00 09.12.2020 15:00 Shift 5.0

As I am a casual at this job I have to take a few things into consideration like penalty rates (baserate, after 6pm on Monday - Friday, Saturday, and Sunday all have different rates). I'm wondering if I can analyse this csv using datetime and calculate how many hours are before 6pm, and how many after 6pm. So using this as an example the output would be like:

Start End Title Hours
1 04.12.2020 15:00 04.12.2020 21:00 Shift 6.0
Start End Title Total Hours Hours before 3pm Hours after 3pm
1 04.12.2020 15:00 04.12.2020 21:00 Shift 6.0 3.0 3.0

I can use this to get the day of the week but I'm just not sure how to analyse certain bits of time for penalty rates:


df['day_of_week'] = df['Start'].dt.day_name()

I appreciate any help in Python or even other coding languages/techniques this can be applied to:)

Edit: This is how my dataframe is looking at the moment

Start End Title Hours day_of_week Pay week_of_year
0 2020-12-02 07:00:00 2020-12-02 16:00:00 Shift 9.0 Wednesday 337.30 49

EDIT In response to David Erickson's comment.

value variable bool
0 2020-12-02 07:00:00 Start False
1 2020-12-02 08:00:00 Start False
2 2020-12-02 09:00:00 Start False
3 2020-12-02 10:00:00 Start False
4 2020-12-02 11:00:00 Start False
5 2020-12-02 12:00:00 Start False
6 2020-12-02 13:00:00 Start False
7 2020-12-02 14:00:00 Start False
8 2020-12-02 15:00:00 Start False
9 2020-12-02 16:00:00 End False
10 2020-12-04 18:00:00 Start False
11 2020-12-04 19:00:00 Start True
12 2020-12-04 20:00:00 Start True
13 2020-12-04 21:00:00 End True
14 2020-12-05 07:00:00 Start False
15 2020-12-05 08:00:00 Start False
16 2020-12-05 09:00:00 Start False
17 2020-12-05 10:00:00 Start False
18 2020-12-05 11:00:00 Start False
19 2020-12-05 12:00:00 End False
20 2020-12-06 09:00:00 Start False
21 2020-12-06 10:00:00 Start False
22 2020-12-06 11:00:00 Start False
23 2020-12-06 12:00:00 Start False
24 2020-12-06 13:00:00 Start False
25 2020-12-06 14:00:00 Start False
26 2020-12-06 15:00:00 Start False
27 2020-12-06 6:00:00 Start False
28 2020-12-06 17:00:00 Start False
29 2020-12-06 18:00:00 End False
30 2020-12-07 19:00:00 Start False
31 2020-12-07 20:00:00 Start True
32 2020-12-07 21:00:00 Start True
33 2020-12-07 22:00:00 Start True
34 2020-12-07 23:00:00 End True
35 2020-12-08 19:00:00 Start False
36 2020-12-08 20:00:00 Start True
37 2020-12-08 21:00:00 Start True
38 2020-12-08 22:00:00 Start True
39 2020-12-08 23:00:00 End True
40 2020-12-09 10:00:00 Start False
41 2020-12-09 11:00:00 Start False
42 2020-12-09 12:00:00 Start False
43 2020-12-09 13:00:00 Start False
44 2020-12-09 14:00:00 Start False
45 2020-12-09 15:00:00 End False
46 2020-12-11 19:00:00 Start False
47 2020-12-11 20:00:00 Start True
48 2020-12-11 21:00:00 Start True
49 2020-12-11 22:00:00 Start True

UPDATE: (2020-12-19)

I have simply filtered out the Start rows, as you were correct an extra row wa being calculated. Also, I passed dayfirst=True to pd.to_datetime() to convert the date correctly. I have also made the output clean with some extra columns.

higher_pay = 40
lower_pay = 30

df['Start'], df['End'] = pd.to_datetime(df['Start'], dayfirst=True), pd.to_datetime(df['End'], dayfirst=True)
start = df['Start']
df1 = df[['Start', 'End']].melt(value_name='Date').set_index('Date')
s = df1.groupby('variable').cumcount()
df1 = df1.groupby(s, group_keys=False).resample('1H').asfreq().join(s.rename('Shift').to_frame()).ffill().reset_index()
df1 = df1[~df1['Date'].isin(start)]
df1['Day'] = df1['Date'].dt.day_name()
df1['Week'] = df1['Date'].dt.isocalendar().week
m = (df1['Date'].dt.hour > 18) | (df1['Day'].isin(['Saturday', 'Sunday']))
df1['Higher Pay Hours'] = np.where(m, 1, 0)
df1['Lower Pay Hours'] = np.where(m, 0, 1)
df1['Pay'] = np.where(m, higher_pay, lower_pay)
df1 = df1.groupby(['Shift', 'Day', 'Week']).sum().reset_index()
df2 = df.merge(df1, how='left', left_index=True, right_on='Shift').drop('Shift', axis=1)
df2
Out[1]: 
                Start                 End  Title  Hours        Day  Week  \
0 2020-12-02 07:00:00 2020-12-02 16:00:00  Shift    9.0  Wednesday    49   
1 2020-12-04 18:00:00 2020-12-04 21:00:00  Shift    3.0     Friday    49   
2 2020-12-05 07:00:00 2020-12-05 12:00:00  Shift    5.0   Saturday    49   
3 2020-12-06 09:00:00 2020-12-06 18:00:00  Shift    9.0     Sunday    49   
4 2020-12-07 19:00:00 2020-12-07 23:00:00  Shift    4.0     Monday    50   
5 2020-12-08 19:00:00 2020-12-08 23:00:00  Shift    4.0    Tuesday    50   
6 2020-12-09 10:00:00 2020-12-09 15:00:00  Shift    5.0  Wednesday    50   

   Higher Pay Hours  Lower Pay Hours  Pay  
0                 0                9  270  
1                 3                0  120  
2                 5                0  200  
3                 9                0  360  
4                 4                0  160  
5                 4                0  160  
6                 0                5  150  

There are probably more concise ways to do this, but I thought resampling the dataframe and then counting the hours would be a clean approach. You can melt the dataframe to have Start and End in the same column and fill in the gap hours with resample making sure to groupby by the 'Start' and 'End' values that were initially on the same row. The easiest way to figure out which rows were initially together is to get the cumulative count with cumcount of the values in the new the dataframe grouped by 'Start' and 'End'. I'll show you how this works later in the answer.

Full Code:

df['Start'], df['End'] = pd.to_datetime(df['Start']), pd.to_datetime(df['End'])
df = df[['Start', 'End']].melt().set_index('value')
df = df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26

Adding some more details...

Step 1: Melt the dataframe: You only need two columns 'Start' and 'End' to get your desired output

df = df[['Start', 'End']].melt().set_index('value')
df
Out[1]: 
                    variable
value                       
2020-02-12 07:00:00    Start
2020-04-12 18:00:00    Start
2020-05-12 07:00:00    Start
2020-06-12 09:00:00    Start
2020-07-12 19:00:00    Start
2020-08-12 19:00:00    Start
2020-09-12 10:00:00    Start
2020-02-12 16:00:00      End
2020-04-12 21:00:00      End
2020-05-12 12:00:00      End
2020-06-12 18:00:00      End
2020-07-12 23:00:00      End
2020-08-12 23:00:00      End
2020-09-12 15:00:00      End

Step 2: Create Group in preparation for resample: *As you can see group 0-6 line up with each other representing ' Start' and 'End' as they were together previously

df.groupby('variable').cumcount()
Out[2]: 
value
2020-02-12 07:00:00    0
2020-04-12 18:00:00    1
2020-05-12 07:00:00    2
2020-06-12 09:00:00    3
2020-07-12 19:00:00    4
2020-08-12 19:00:00    5
2020-09-12 10:00:00    6
2020-02-12 16:00:00    0
2020-04-12 21:00:00    1
2020-05-12 12:00:00    2
2020-06-12 18:00:00    3
2020-07-12 23:00:00    4
2020-08-12 23:00:00    5
2020-09-12 15:00:00    6

Step 3: Resample the data per group by hour to fill in the gaps for each group:

df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
Out[3]: 
                 value variable
0  2020-02-12 07:00:00    Start
1  2020-02-12 08:00:00    Start
2  2020-02-12 09:00:00    Start
3  2020-02-12 10:00:00    Start
4  2020-02-12 11:00:00    Start
5  2020-02-12 12:00:00    Start
6  2020-02-12 13:00:00    Start
7  2020-02-12 14:00:00    Start
8  2020-02-12 15:00:00    Start
9  2020-02-12 16:00:00      End
10 2020-04-12 18:00:00    Start
11 2020-04-12 19:00:00    Start
12 2020-04-12 20:00:00    Start
13 2020-04-12 21:00:00      End
14 2020-05-12 07:00:00    Start
15 2020-05-12 08:00:00    Start
16 2020-05-12 09:00:00    Start
17 2020-05-12 10:00:00    Start
18 2020-05-12 11:00:00    Start
19 2020-05-12 12:00:00      End
20 2020-06-12 09:00:00    Start
21 2020-06-12 10:00:00    Start
22 2020-06-12 11:00:00    Start
23 2020-06-12 12:00:00    Start
24 2020-06-12 13:00:00    Start
25 2020-06-12 14:00:00    Start
26 2020-06-12 15:00:00    Start
27 2020-06-12 16:00:00    Start
28 2020-06-12 17:00:00    Start
29 2020-06-12 18:00:00      End
30 2020-07-12 19:00:00    Start
31 2020-07-12 20:00:00    Start
32 2020-07-12 21:00:00    Start
33 2020-07-12 22:00:00    Start
34 2020-07-12 23:00:00      End
35 2020-08-12 19:00:00    Start
36 2020-08-12 20:00:00    Start
37 2020-08-12 21:00:00    Start
38 2020-08-12 22:00:00    Start
39 2020-08-12 23:00:00      End
40 2020-09-12 10:00:00    Start
41 2020-09-12 11:00:00    Start
42 2020-09-12 12:00:00    Start
43 2020-09-12 13:00:00    Start
44 2020-09-12 14:00:00    Start
45 2020-09-12 15:00:00      End

Step 4 - From there, you can calculate the boolean series I have called m : *True values represent conditions met for "Higher Rate".

m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
m
Out[4]: 
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10     True
11     True
12     True
13     True
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30     True
31     True
32     True
33     True
34     True
35     True
36     True
37     True
38     True
39     True
40     True
41     True
42     True
43     True
44     True
45     True

Step 5: Filter the dataframe by True or False to count total hours for the Normal Rate and Higher Rate and print values.

print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM