[英]Find whether an event (with a start & end time) goes beyond a certain time (e.g. 6pm) in dataframe using Python (pandas, datetime)
I'm creating a Python program using pandas and datetime libraries that will calculate the pay from my casual job each week, so I can cross reference my bank statement instead of looking through payslips.我正在使用 pandas 和 datetime 库创建一个 Python 程序,这些库将计算我每周临时工作的工资,因此我可以交叉引用我的银行对账单,而不是查看工资单。 The data that I am analysing is from the Google Calendar API that is synced with my work schedule.
我正在分析的数据来自与我的工作日程同步的 Google 日历 API。 It prints the events in that particular calendar to a csv file in this format:
它将特定日历中的事件打印到 csv 文件中,格式如下:
Start![]() |
End![]() |
Title![]() |
Hours![]() |
|
---|---|---|---|---|
0 ![]() |
02.12.2020 07:00 ![]() |
02.12.2020 16:00 ![]() |
Shift![]() |
9.0 ![]() |
1 ![]() |
04.12.2020 18:00 ![]() |
04.12.2020 21:00 ![]() |
Shift![]() |
3.0 ![]() |
2 ![]() |
05.12.2020 07:00 ![]() |
05.12.2020 12:00 ![]() |
Shift![]() |
5.0 ![]() |
3 ![]() |
06.12.2020 09:00 ![]() |
06.12.2020 18:00 ![]() |
Shift![]() |
9.0 ![]() |
4 ![]() |
07.12.2020 19:00 ![]() |
07.12.2020 23:00 ![]() |
Shift![]() |
4.0 ![]() |
5 ![]() |
08.12.2020 19:00 ![]() |
08.12.2020 23:00 ![]() |
Shift![]() |
4.0 ![]() |
6 ![]() |
09.12.2020 10:00 ![]() |
09.12.2020 15:00 ![]() |
Shift![]() |
5.0 ![]() |
As I am a casual at this job I have to take a few things into consideration like penalty rates (baserate, after 6pm on Monday - Friday, Saturday, and Sunday all have different rates).由于我是这份工作的临时工,因此我必须考虑一些事情,例如加班费(基本费率,周一至周五下午 6 点之后,周六和周日都有不同的费率)。 I'm wondering if I can analyse this csv using datetime and calculate how many hours are before 6pm, and how many after 6pm.
我想知道我是否可以使用日期时间分析这个 csv 并计算下午 6 点之前有多少小时,以及下午 6 点之后有多少小时。 So using this as an example the output would be like:
因此,以此为例,output 将类似于:
Start![]() |
End![]() |
Title![]() |
Hours![]() |
|
---|---|---|---|---|
1 ![]() |
04.12.2020 15:00 ![]() |
04.12.2020 21:00 ![]() |
Shift![]() |
6.0 ![]() |
Start![]() |
End![]() |
Title![]() |
Total Hours![]() |
Hours before 3pm![]() |
Hours after 3pm![]() |
|
---|---|---|---|---|---|---|
1 ![]() |
04.12.2020 15:00 ![]() |
04.12.2020 21:00 ![]() |
Shift![]() |
6.0 ![]() |
3.0 ![]() |
3.0 ![]() |
I can use this to get the day of the week but I'm just not sure how to analyse certain bits of time for penalty rates:我可以使用它来获取星期几,但我只是不确定如何分析某些时间的罚款率:
df['day_of_week'] = df['Start'].dt.day_name()
I appreciate any help in Python or even other coding languages/techniques this can be applied to:)我感谢 Python 或什至可以应用于其他编码语言/技术的任何帮助:)
Edit: This is how my dataframe is looking at the moment编辑:这就是我的 dataframe 现在的样子
Start![]() |
End![]() |
Title![]() |
Hours![]() |
day_of_week ![]() |
Pay![]() |
week_of_year ![]() |
|
---|---|---|---|---|---|---|---|
0 ![]() |
2020-12-02 07:00:00 ![]() |
2020-12-02 16:00:00 ![]() |
Shift![]() |
9.0 ![]() |
Wednesday![]() |
337.30 ![]() |
49 ![]() |
EDIT In response to David Erickson's comment.编辑回应大卫埃里克森的评论。
value![]() |
variable![]() |
bool![]() |
|
---|---|---|---|
0 ![]() |
2020-12-02 07:00:00 ![]() |
Start![]() |
False![]() |
1 ![]() |
2020-12-02 08:00:00 ![]() |
Start![]() |
False![]() |
2 ![]() |
2020-12-02 09:00:00 ![]() |
Start![]() |
False![]() |
3 ![]() |
2020-12-02 10:00:00 ![]() |
Start![]() |
False![]() |
4 ![]() |
2020-12-02 11:00:00 ![]() |
Start![]() |
False![]() |
5 ![]() |
2020-12-02 12:00:00 ![]() |
Start![]() |
False![]() |
6 ![]() |
2020-12-02 13:00:00 ![]() |
Start![]() |
False![]() |
7 ![]() |
2020-12-02 14:00:00 ![]() |
Start![]() |
False![]() |
8 ![]() |
2020-12-02 15:00:00 ![]() |
Start![]() |
False![]() |
9 ![]() |
2020-12-02 16:00:00 ![]() |
End![]() |
False![]() |
10 ![]() |
2020-12-04 18:00:00 ![]() |
Start![]() |
False![]() |
11 ![]() |
2020-12-04 19:00:00 ![]() |
Start![]() |
True![]() |
12 ![]() |
2020-12-04 20:00:00 ![]() |
Start![]() |
True![]() |
13 ![]() |
2020-12-04 21:00:00 ![]() |
End![]() |
True![]() |
14 ![]() |
2020-12-05 07:00:00 ![]() |
Start![]() |
False![]() |
15 ![]() |
2020-12-05 08:00:00 ![]() |
Start![]() |
False![]() |
16 ![]() |
2020-12-05 09:00:00 ![]() |
Start![]() |
False![]() |
17 ![]() |
2020-12-05 10:00:00 ![]() |
Start![]() |
False![]() |
18 ![]() |
2020-12-05 11:00:00 ![]() |
Start![]() |
False![]() |
19 ![]() |
2020-12-05 12:00:00 ![]() |
End![]() |
False![]() |
20 ![]() |
2020-12-06 09:00:00 ![]() |
Start![]() |
False![]() |
21 ![]() |
2020-12-06 10:00:00 ![]() |
Start![]() |
False![]() |
22 ![]() |
2020-12-06 11:00:00 ![]() |
Start![]() |
False![]() |
23 ![]() |
2020-12-06 12:00:00 ![]() |
Start![]() |
False![]() |
24 ![]() |
2020-12-06 13:00:00 ![]() |
Start![]() |
False![]() |
25 ![]() |
2020-12-06 14:00:00 ![]() |
Start![]() |
False![]() |
26 ![]() |
2020-12-06 15:00:00 ![]() |
Start![]() |
False![]() |
27 ![]() |
2020-12-06 6:00:00 ![]() |
Start![]() |
False![]() |
28 ![]() |
2020-12-06 17:00:00 ![]() |
Start![]() |
False![]() |
29 ![]() |
2020-12-06 18:00:00 ![]() |
End![]() |
False![]() |
30 ![]() |
2020-12-07 19:00:00 ![]() |
Start![]() |
False![]() |
31 ![]() |
2020-12-07 20:00:00 ![]() |
Start![]() |
True![]() |
32 ![]() |
2020-12-07 21:00:00 ![]() |
Start![]() |
True![]() |
33 ![]() |
2020-12-07 22:00:00 ![]() |
Start![]() |
True![]() |
34 ![]() |
2020-12-07 23:00:00 ![]() |
End![]() |
True![]() |
35 ![]() |
2020-12-08 19:00:00 ![]() |
Start![]() |
False![]() |
36 ![]() |
2020-12-08 20:00:00 ![]() |
Start![]() |
True![]() |
37 ![]() |
2020-12-08 21:00:00 ![]() |
Start![]() |
True![]() |
38 ![]() |
2020-12-08 22:00:00 ![]() |
Start![]() |
True![]() |
39 ![]() |
2020-12-08 23:00:00 ![]() |
End![]() |
True![]() |
40 ![]() |
2020-12-09 10:00:00 ![]() |
Start![]() |
False![]() |
41 ![]() |
2020-12-09 11:00:00 ![]() |
Start![]() |
False![]() |
42 ![]() |
2020-12-09 12:00:00 ![]() |
Start![]() |
False![]() |
43 ![]() |
2020-12-09 13:00:00 ![]() |
Start![]() |
False![]() |
44 ![]() |
2020-12-09 14:00:00 ![]() |
Start![]() |
False![]() |
45 ![]() |
2020-12-09 15:00:00 ![]() |
End![]() |
False![]() |
46 ![]() |
2020-12-11 19:00:00 ![]() |
Start![]() |
False![]() |
47 ![]() |
2020-12-11 20:00:00 ![]() |
Start![]() |
True![]() |
48 ![]() |
2020-12-11 21:00:00 ![]() |
Start![]() |
True![]() |
49 ![]() |
2020-12-11 22:00:00 ![]() |
Start![]() |
True![]() |
UPDATE: (2020-12-19)更新:(2020-12-19)
I have simply filtered out the Start
rows, as you were correct an extra row wa being calculated.我只是过滤掉了
Start
行,因为你是正确的,正在计算额外的行。 Also, I passed dayfirst=True
to pd.to_datetime()
to convert the date correctly.此外,我将
dayfirst=True
传递给pd.to_datetime()
以正确转换日期。 I have also made the output clean with some extra columns.我还用一些额外的列使 output 变得干净。
higher_pay = 40
lower_pay = 30
df['Start'], df['End'] = pd.to_datetime(df['Start'], dayfirst=True), pd.to_datetime(df['End'], dayfirst=True)
start = df['Start']
df1 = df[['Start', 'End']].melt(value_name='Date').set_index('Date')
s = df1.groupby('variable').cumcount()
df1 = df1.groupby(s, group_keys=False).resample('1H').asfreq().join(s.rename('Shift').to_frame()).ffill().reset_index()
df1 = df1[~df1['Date'].isin(start)]
df1['Day'] = df1['Date'].dt.day_name()
df1['Week'] = df1['Date'].dt.isocalendar().week
m = (df1['Date'].dt.hour > 18) | (df1['Day'].isin(['Saturday', 'Sunday']))
df1['Higher Pay Hours'] = np.where(m, 1, 0)
df1['Lower Pay Hours'] = np.where(m, 0, 1)
df1['Pay'] = np.where(m, higher_pay, lower_pay)
df1 = df1.groupby(['Shift', 'Day', 'Week']).sum().reset_index()
df2 = df.merge(df1, how='left', left_index=True, right_on='Shift').drop('Shift', axis=1)
df2
Out[1]:
Start End Title Hours Day Week \
0 2020-12-02 07:00:00 2020-12-02 16:00:00 Shift 9.0 Wednesday 49
1 2020-12-04 18:00:00 2020-12-04 21:00:00 Shift 3.0 Friday 49
2 2020-12-05 07:00:00 2020-12-05 12:00:00 Shift 5.0 Saturday 49
3 2020-12-06 09:00:00 2020-12-06 18:00:00 Shift 9.0 Sunday 49
4 2020-12-07 19:00:00 2020-12-07 23:00:00 Shift 4.0 Monday 50
5 2020-12-08 19:00:00 2020-12-08 23:00:00 Shift 4.0 Tuesday 50
6 2020-12-09 10:00:00 2020-12-09 15:00:00 Shift 5.0 Wednesday 50
Higher Pay Hours Lower Pay Hours Pay
0 0 9 270
1 3 0 120
2 5 0 200
3 9 0 360
4 4 0 160
5 4 0 160
6 0 5 150
There are probably more concise ways to do this, but I thought resampling the dataframe and then counting the hours would be a clean approach.可能有更简洁的方法可以做到这一点,但我认为重新采样 dataframe 然后计算小时数将是一种干净的方法。 You can
melt
the dataframe to have Start
and End
in the same column and fill in the gap hours with resample
making sure to groupby
by the 'Start' and 'End' values that were initially on the same row.您可以
melt
groupby
以使Start
和End
在同一列中,并通过resample
填充间隙时间,确保按最初位于同一行的“Start”和“End”值进行分组。 The easiest way to figure out which rows were initially together is to get the cumulative count with cumcount
of the values in the new the dataframe grouped by 'Start' and 'End'.找出哪些行最初在一起的最简单方法是获取按“开始”和“结束”分组的新
cumcount
中的值的累积计数。 I'll show you how this works later in the answer.我将在稍后的答案中向您展示这是如何工作的。
Full Code:完整代码:
df['Start'], df['End'] = pd.to_datetime(df['Start']), pd.to_datetime(df['End'])
df = df[['Start', 'End']].melt().set_index('value')
df = df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26
Adding some more details...添加更多细节...
Step 1: Melt the dataframe: You only need two columns 'Start' and 'End' to get your desired output第 1 步:熔化 dataframe:您只需要两列“开始”和“结束”即可获得所需的 output
df = df[['Start', 'End']].melt().set_index('value')
df
Out[1]:
variable
value
2020-02-12 07:00:00 Start
2020-04-12 18:00:00 Start
2020-05-12 07:00:00 Start
2020-06-12 09:00:00 Start
2020-07-12 19:00:00 Start
2020-08-12 19:00:00 Start
2020-09-12 10:00:00 Start
2020-02-12 16:00:00 End
2020-04-12 21:00:00 End
2020-05-12 12:00:00 End
2020-06-12 18:00:00 End
2020-07-12 23:00:00 End
2020-08-12 23:00:00 End
2020-09-12 15:00:00 End
Step 2: Create Group in preparation for resample: *As you can see group 0-6 line up with each other representing ' Start' and 'End' as they were together previously第 2 步:创建组以准备重新采样: *如您所见,组 0-6 彼此对齐,代表“开始”和“结束”,因为它们以前在一起
df.groupby('variable').cumcount()
Out[2]:
value
2020-02-12 07:00:00 0
2020-04-12 18:00:00 1
2020-05-12 07:00:00 2
2020-06-12 09:00:00 3
2020-07-12 19:00:00 4
2020-08-12 19:00:00 5
2020-09-12 10:00:00 6
2020-02-12 16:00:00 0
2020-04-12 21:00:00 1
2020-05-12 12:00:00 2
2020-06-12 18:00:00 3
2020-07-12 23:00:00 4
2020-08-12 23:00:00 5
2020-09-12 15:00:00 6
Step 3: Resample the data per group by hour to fill in the gaps for each group:第 3 步:按小时对每组数据重新采样,以填补每组的空白:
df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
Out[3]:
value variable
0 2020-02-12 07:00:00 Start
1 2020-02-12 08:00:00 Start
2 2020-02-12 09:00:00 Start
3 2020-02-12 10:00:00 Start
4 2020-02-12 11:00:00 Start
5 2020-02-12 12:00:00 Start
6 2020-02-12 13:00:00 Start
7 2020-02-12 14:00:00 Start
8 2020-02-12 15:00:00 Start
9 2020-02-12 16:00:00 End
10 2020-04-12 18:00:00 Start
11 2020-04-12 19:00:00 Start
12 2020-04-12 20:00:00 Start
13 2020-04-12 21:00:00 End
14 2020-05-12 07:00:00 Start
15 2020-05-12 08:00:00 Start
16 2020-05-12 09:00:00 Start
17 2020-05-12 10:00:00 Start
18 2020-05-12 11:00:00 Start
19 2020-05-12 12:00:00 End
20 2020-06-12 09:00:00 Start
21 2020-06-12 10:00:00 Start
22 2020-06-12 11:00:00 Start
23 2020-06-12 12:00:00 Start
24 2020-06-12 13:00:00 Start
25 2020-06-12 14:00:00 Start
26 2020-06-12 15:00:00 Start
27 2020-06-12 16:00:00 Start
28 2020-06-12 17:00:00 Start
29 2020-06-12 18:00:00 End
30 2020-07-12 19:00:00 Start
31 2020-07-12 20:00:00 Start
32 2020-07-12 21:00:00 Start
33 2020-07-12 22:00:00 Start
34 2020-07-12 23:00:00 End
35 2020-08-12 19:00:00 Start
36 2020-08-12 20:00:00 Start
37 2020-08-12 21:00:00 Start
38 2020-08-12 22:00:00 Start
39 2020-08-12 23:00:00 End
40 2020-09-12 10:00:00 Start
41 2020-09-12 11:00:00 Start
42 2020-09-12 12:00:00 Start
43 2020-09-12 13:00:00 Start
44 2020-09-12 14:00:00 Start
45 2020-09-12 15:00:00 End
Step 4 - From there, you can calculate the boolean series I have called m
: *True values represent conditions met for "Higher Rate".第 4 步 - 从那里,您可以计算我称之为
m
的 boolean 系列: *真值表示满足“更高速率”的条件。
m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
m
Out[4]:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 True
11 True
12 True
13 True
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
30 True
31 True
32 True
33 True
34 True
35 True
36 True
37 True
38 True
39 True
40 True
41 True
42 True
43 True
44 True
45 True
Step 5: Filter the dataframe by True
or False
to count total hours for the Normal Rate and Higher Rate and print values.第 5 步:按
True
或False
过滤 dataframe 以计算 Normal Rate 和 Higher Rate 的总小时数并打印值。
print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.