使用 Python (pandas, datetime) 在 dataframe 中查找事件（具有开始和结束时间）是否超过特定时间（例如下午 6 点）

Question

I'm creating a Python program using pandas and datetime libraries that will calculate the pay from my casual job each week, so I can cross reference my bank statement instead of looking through payslips.我正在使用 pandas 和 datetime 库创建一个 Python 程序，这些库将计算我每周临时工作的工资，因此我可以交叉引用我的银行对账单，而不是查看工资单。 The data that I am analysing is from the Google Calendar API that is synced with my work schedule.我正在分析的数据来自与我的工作日程同步的 Google 日历 API。 It prints the events in that particular calendar to a csv file in this format:它将特定日历中的事件打印到 csv 文件中，格式如下：

	Start开始	End结尾	Title标题	Hours小时
0 0	02.12.2020 07:00 02.12.2020 07:00	02.12.2020 16:00 02.12.2020 16:00	Shift转移	9.0 9.0
1 1	04.12.2020 18:00 04.12.2020 18:00	04.12.2020 21:00 04.12.2020 21:00	Shift转移	3.0 3.0
2 2	05.12.2020 07:00 05.12.2020 07:00	05.12.2020 12:00 05.12.2020 12:00	Shift转移	5.0 5.0
3 3	06.12.2020 09:00 06.12.2020 09:00	06.12.2020 18:00 06.12.2020 18:00	Shift转移	9.0 9.0
4 4	07.12.2020 19:00 07.12.2020 19:00	07.12.2020 23:00 07.12.2020 23:00	Shift转移	4.0 4.0
5 5	08.12.2020 19:00 08.12.2020 19:00	08.12.2020 23:00 08.12.2020 23:00	Shift转移	4.0 4.0
6 6	09.12.2020 10:00 09.12.2020 10:00	09.12.2020 15:00 09.12.2020 15:00	Shift转移	5.0 5.0

As I am a casual at this job I have to take a few things into consideration like penalty rates (baserate, after 6pm on Monday - Friday, Saturday, and Sunday all have different rates).由于我是这份工作的临时工，因此我必须考虑一些事情，例如加班费（基本费率，周一至周五下午 6 点之后，周六和周日都有不同的费率）。 I'm wondering if I can analyse this csv using datetime and calculate how many hours are before 6pm, and how many after 6pm.我想知道我是否可以使用日期时间分析这个 csv 并计算下午 6 点之前有多少小时，以及下午 6 点之后有多少小时。 So using this as an example the output would be like:因此，以此为例，output 将类似于：

	Start开始	End结尾	Title标题	Hours小时
1 1	04.12.2020 15:00 04.12.2020 15:00	04.12.2020 21:00 04.12.2020 21:00	Shift转移	6.0 6.0

	Start开始	End结尾	Title标题	Total Hours全部小时数	Hours before 3pm下午 3 点前的几个小时	Hours after 3pm下午 3 点后数小时
1 1	04.12.2020 15:00 04.12.2020 15:00	04.12.2020 21:00 04.12.2020 21:00	Shift转移	6.0 6.0	3.0 3.0	3.0 3.0

I can use this to get the day of the week but I'm just not sure how to analyse certain bits of time for penalty rates:我可以使用它来获取星期几，但我只是不确定如何分析某些时间的罚款率：


df['day_of_week'] = df['Start'].dt.day_name()

I appreciate any help in Python or even other coding languages/techniques this can be applied to:)我感谢 Python 或什至可以应用于其他编码语言/技术的任何帮助:)

Edit: This is how my dataframe is looking at the moment编辑：这就是我的 dataframe 现在的样子

	Start开始	End结尾	Title标题	Hours小时	day_of_week day_of_week	Pay支付	week_of_year week_of_year
0 0	2020-12-02 07:00:00 2020-12-02 07:00:00	2020-12-02 16:00:00 2020-12-02 16:00:00	Shift转移	9.0 9.0	Wednesday周三	337.30 337.30	49 49

EDIT In response to David Erickson's comment.编辑回应大卫埃里克森的评论。

	value价值	variable多变的	bool布尔
0 0	2020-12-02 07:00:00 2020-12-02 07:00:00	Start开始	False错误的
1 1	2020-12-02 08:00:00 2020-12-02 08:00:00	Start开始	False错误的
2 2	2020-12-02 09:00:00 2020-12-02 09:00:00	Start开始	False错误的
3 3	2020-12-02 10:00:00 2020-12-02 10:00:00	Start开始	False错误的
4 4	2020-12-02 11:00:00 2020-12-02 11:00:00	Start开始	False错误的
5 5	2020-12-02 12:00:00 2020-12-02 12:00:00	Start开始	False错误的
6 6	2020-12-02 13:00:00 2020-12-02 13:00:00	Start开始	False错误的
7 7	2020-12-02 14:00:00 2020-12-02 14:00:00	Start开始	False错误的
8 8	2020-12-02 15:00:00 2020-12-02 15:00:00	Start开始	False错误的
9 9	2020-12-02 16:00:00 2020-12-02 16:00:00	End结尾	False错误的
10 10	2020-12-04 18:00:00 2020-12-04 18:00:00	Start开始	False错误的
11 11	2020-12-04 19:00:00 2020-12-04 19:00:00	Start开始	True真的
12 12	2020-12-04 20:00:00 2020-12-04 20:00:00	Start开始	True真的
13 13	2020-12-04 21:00:00 2020-12-04 21:00:00	End结尾	True真的
14 14	2020-12-05 07:00:00 2020-12-05 07:00:00	Start开始	False错误的
15 15	2020-12-05 08:00:00 2020-12-05 08:00:00	Start开始	False错误的
16 16	2020-12-05 09:00:00 2020-12-05 09:00:00	Start开始	False错误的
17 17	2020-12-05 10:00:00 2020-12-05 10:00:00	Start开始	False错误的
18 18	2020-12-05 11:00:00 2020-12-05 11:00:00	Start开始	False错误的
19 19	2020-12-05 12:00:00 2020-12-05 12:00:00	End结尾	False错误的
20 20	2020-12-06 09:00:00 2020-12-06 09:00:00	Start开始	False错误的
21 21	2020-12-06 10:00:00 2020-12-06 10:00:00	Start开始	False错误的
22 22	2020-12-06 11:00:00 2020-12-06 11:00:00	Start开始	False错误的
23 23	2020-12-06 12:00:00 2020-12-06 12:00:00	Start开始	False错误的
24 24	2020-12-06 13:00:00 2020-12-06 13:00:00	Start开始	False错误的
25 25	2020-12-06 14:00:00 2020-12-06 14:00:00	Start开始	False错误的
26 26	2020-12-06 15:00:00 2020-12-06 15:00:00	Start开始	False错误的
27 27	2020-12-06 6:00:00 2020-12-06 6:00:00	Start开始	False错误的
28 28	2020-12-06 17:00:00 2020-12-06 17:00:00	Start开始	False错误的
29 29	2020-12-06 18:00:00 2020-12-06 18:00:00	End结尾	False错误的
30 30	2020-12-07 19:00:00 2020-12-07 19:00:00	Start开始	False错误的
31 31	2020-12-07 20:00:00 2020-12-07 20:00:00	Start开始	True真的
32 32	2020-12-07 21:00:00 2020-12-07 21:00:00	Start开始	True真的
33 33	2020-12-07 22:00:00 2020-12-07 22:00:00	Start开始	True真的
34 34	2020-12-07 23:00:00 2020-12-07 23:00:00	End结尾	True真的
35 35	2020-12-08 19:00:00 2020-12-08 19:00:00	Start开始	False错误的
36 36	2020-12-08 20:00:00 2020-12-08 20:00:00	Start开始	True真的
37 37	2020-12-08 21:00:00 2020-12-08 21:00:00	Start开始	True真的
38 38	2020-12-08 22:00:00 2020-12-08 22:00:00	Start开始	True真的
39 39	2020-12-08 23:00:00 2020-12-08 23:00:00	End结尾	True真的
40 40	2020-12-09 10:00:00 2020-12-09 10:00:00	Start开始	False错误的
41 41	2020-12-09 11:00:00 2020-12-09 11:00:00	Start开始	False错误的
42 42	2020-12-09 12:00:00 2020-12-09 12:00:00	Start开始	False错误的
43 43	2020-12-09 13:00:00 2020-12-09 13:00:00	Start开始	False错误的
44 44	2020-12-09 14:00:00 2020-12-09 14:00:00	Start开始	False错误的
45 45	2020-12-09 15:00:00 2020-12-09 15:00:00	End结尾	False错误的
46 46	2020-12-11 19:00:00 2020-12-11 19:00:00	Start开始	False错误的
47 47	2020-12-11 20:00:00 2020-12-11 20:00:00	Start开始	True真的
48 48	2020-12-11 21:00:00 2020-12-11 21:00:00	Start开始	True真的
49 49	2020-12-11 22:00:00 2020-12-11 22:00:00	Start开始	True真的

Answer 1

UPDATE: (2020-12-19)更新：（2020-12-19）

I have simply filtered out the Start rows, as you were correct an extra row wa being calculated.我只是过滤掉了Start行，因为你是正确的，正在计算额外的行。 Also, I passed dayfirst=True to pd.to_datetime() to convert the date correctly.此外，我将dayfirst=True传递给pd.to_datetime()以正确转换日期。 I have also made the output clean with some extra columns.我还用一些额外的列使 output 变得干净。

higher_pay = 40
lower_pay = 30

df['Start'], df['End'] = pd.to_datetime(df['Start'], dayfirst=True), pd.to_datetime(df['End'], dayfirst=True)
start = df['Start']
df1 = df[['Start', 'End']].melt(value_name='Date').set_index('Date')
s = df1.groupby('variable').cumcount()
df1 = df1.groupby(s, group_keys=False).resample('1H').asfreq().join(s.rename('Shift').to_frame()).ffill().reset_index()
df1 = df1[~df1['Date'].isin(start)]
df1['Day'] = df1['Date'].dt.day_name()
df1['Week'] = df1['Date'].dt.isocalendar().week
m = (df1['Date'].dt.hour > 18) | (df1['Day'].isin(['Saturday', 'Sunday']))
df1['Higher Pay Hours'] = np.where(m, 1, 0)
df1['Lower Pay Hours'] = np.where(m, 0, 1)
df1['Pay'] = np.where(m, higher_pay, lower_pay)
df1 = df1.groupby(['Shift', 'Day', 'Week']).sum().reset_index()
df2 = df.merge(df1, how='left', left_index=True, right_on='Shift').drop('Shift', axis=1)
df2
Out[1]: 
                Start                 End  Title  Hours        Day  Week  \
0 2020-12-02 07:00:00 2020-12-02 16:00:00  Shift    9.0  Wednesday    49   
1 2020-12-04 18:00:00 2020-12-04 21:00:00  Shift    3.0     Friday    49   
2 2020-12-05 07:00:00 2020-12-05 12:00:00  Shift    5.0   Saturday    49   
3 2020-12-06 09:00:00 2020-12-06 18:00:00  Shift    9.0     Sunday    49   
4 2020-12-07 19:00:00 2020-12-07 23:00:00  Shift    4.0     Monday    50   
5 2020-12-08 19:00:00 2020-12-08 23:00:00  Shift    4.0    Tuesday    50   
6 2020-12-09 10:00:00 2020-12-09 15:00:00  Shift    5.0  Wednesday    50   

   Higher Pay Hours  Lower Pay Hours  Pay  
0                 0                9  270  
1                 3                0  120  
2                 5                0  200  
3                 9                0  360  
4                 4                0  160  
5                 4                0  160  
6                 0                5  150

There are probably more concise ways to do this, but I thought resampling the dataframe and then counting the hours would be a clean approach.可能有更简洁的方法可以做到这一点，但我认为重新采样 dataframe 然后计算小时数将是一种干净的方法。 You can melt the dataframe to have Start and End in the same column and fill in the gap hours with resample making sure to groupby by the 'Start' and 'End' values that were initially on the same row.您可以melt groupby以使Start和End在同一列中，并通过resample填充间隙时间，确保按最初位于同一行的“Start”和“End”值进行分组。 The easiest way to figure out which rows were initially together is to get the cumulative count with cumcount of the values in the new the dataframe grouped by 'Start' and 'End'.找出哪些行最初在一起的最简单方法是获取按“开始”和“结束”分组的新cumcount中的值的累积计数。 I'll show you how this works later in the answer.我将在稍后的答案中向您展示这是如何工作的。

Full Code:完整代码：

df['Start'], df['End'] = pd.to_datetime(df['Start']), pd.to_datetime(df['End'])
df = df[['Start', 'End']].melt().set_index('value')
df = df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26

Adding some more details...添加更多细节...

Step 1: Melt the dataframe: You only need two columns 'Start' and 'End' to get your desired output第 1 步：熔化 dataframe：您只需要两列“开始”和“结束”即可获得所需的 output

df = df[['Start', 'End']].melt().set_index('value')
df
Out[1]: 
                    variable
value                       
2020-02-12 07:00:00    Start
2020-04-12 18:00:00    Start
2020-05-12 07:00:00    Start
2020-06-12 09:00:00    Start
2020-07-12 19:00:00    Start
2020-08-12 19:00:00    Start
2020-09-12 10:00:00    Start
2020-02-12 16:00:00      End
2020-04-12 21:00:00      End
2020-05-12 12:00:00      End
2020-06-12 18:00:00      End
2020-07-12 23:00:00      End
2020-08-12 23:00:00      End
2020-09-12 15:00:00      End

Step 2: Create Group in preparation for resample: *As you can see group 0-6 line up with each other representing ' Start' and 'End' as they were together previously第 2 步：创建组以准备重新采样： *如您所见，组 0-6 彼此对齐，代表“开始”和“结束”，因为它们以前在一起

df.groupby('variable').cumcount()
Out[2]: 
value
2020-02-12 07:00:00    0
2020-04-12 18:00:00    1
2020-05-12 07:00:00    2
2020-06-12 09:00:00    3
2020-07-12 19:00:00    4
2020-08-12 19:00:00    5
2020-09-12 10:00:00    6
2020-02-12 16:00:00    0
2020-04-12 21:00:00    1
2020-05-12 12:00:00    2
2020-06-12 18:00:00    3
2020-07-12 23:00:00    4
2020-08-12 23:00:00    5
2020-09-12 15:00:00    6

Step 3: Resample the data per group by hour to fill in the gaps for each group:第 3 步：按小时对每组数据重新采样，以填补每组的空白：

df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
Out[3]: 
                 value variable
0  2020-02-12 07:00:00    Start
1  2020-02-12 08:00:00    Start
2  2020-02-12 09:00:00    Start
3  2020-02-12 10:00:00    Start
4  2020-02-12 11:00:00    Start
5  2020-02-12 12:00:00    Start
6  2020-02-12 13:00:00    Start
7  2020-02-12 14:00:00    Start
8  2020-02-12 15:00:00    Start
9  2020-02-12 16:00:00      End
10 2020-04-12 18:00:00    Start
11 2020-04-12 19:00:00    Start
12 2020-04-12 20:00:00    Start
13 2020-04-12 21:00:00      End
14 2020-05-12 07:00:00    Start
15 2020-05-12 08:00:00    Start
16 2020-05-12 09:00:00    Start
17 2020-05-12 10:00:00    Start
18 2020-05-12 11:00:00    Start
19 2020-05-12 12:00:00      End
20 2020-06-12 09:00:00    Start
21 2020-06-12 10:00:00    Start
22 2020-06-12 11:00:00    Start
23 2020-06-12 12:00:00    Start
24 2020-06-12 13:00:00    Start
25 2020-06-12 14:00:00    Start
26 2020-06-12 15:00:00    Start
27 2020-06-12 16:00:00    Start
28 2020-06-12 17:00:00    Start
29 2020-06-12 18:00:00      End
30 2020-07-12 19:00:00    Start
31 2020-07-12 20:00:00    Start
32 2020-07-12 21:00:00    Start
33 2020-07-12 22:00:00    Start
34 2020-07-12 23:00:00      End
35 2020-08-12 19:00:00    Start
36 2020-08-12 20:00:00    Start
37 2020-08-12 21:00:00    Start
38 2020-08-12 22:00:00    Start
39 2020-08-12 23:00:00      End
40 2020-09-12 10:00:00    Start
41 2020-09-12 11:00:00    Start
42 2020-09-12 12:00:00    Start
43 2020-09-12 13:00:00    Start
44 2020-09-12 14:00:00    Start
45 2020-09-12 15:00:00      End

Step 4 - From there, you can calculate the boolean series I have called m : *True values represent conditions met for "Higher Rate".第 4 步 - 从那里，您可以计算我称之为m的 boolean 系列： *真值表示满足“更高速率”的条件。

m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
m
Out[4]: 
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10     True
11     True
12     True
13     True
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30     True
31     True
32     True
33     True
34     True
35     True
36     True
37     True
38     True
39     True
40     True
41     True
42     True
43     True
44     True
45     True

Step 5: Filter the dataframe by True or False to count total hours for the Normal Rate and Higher Rate and print values.第 5 步：按True或False过滤 dataframe 以计算 Normal Rate 和 Higher Rate 的总小时数并打印值。

print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26

使用 Python (pandas, datetime) 在 dataframe 中查找事件（具有开始和结束时间）是否超过特定时间（例如下午 6 点）

问题描述

1 个解决方案

解决方案1
1 2020-12-12 06:23:32

使用 Python (pandas, datetime) 在 dataframe 中查找事件（具有开始和结束时间）是否超过特定时间（例如下午 6 点）

问题描述

1 个解决方案

解决方案1 1 2020-12-12 06:23:32

解决方案1
1 2020-12-12 06:23:32