简体   繁体   English

使用 Python (pandas, datetime) 在 dataframe 中查找事件(具有开始和结束时间)是否超过特定时间(例如下午 6 点)

[英]Find whether an event (with a start & end time) goes beyond a certain time (e.g. 6pm) in dataframe using Python (pandas, datetime)

I'm creating a Python program using pandas and datetime libraries that will calculate the pay from my casual job each week, so I can cross reference my bank statement instead of looking through payslips.我正在使用 pandas 和 datetime 库创建一个 Python 程序,这些库将计算我每周临时工作的工资,因此我可以交叉引用我的银行对账单,而不是查看工资单。 The data that I am analysing is from the Google Calendar API that is synced with my work schedule.我正在分析的数据来自与我的工作日程同步的 Google 日历 API。 It prints the events in that particular calendar to a csv file in this format:它将特定日历中的事件打印到 csv 文件中,格式如下:

Start开始 End结尾 Title标题 Hours小时
0 0 02.12.2020 07:00 02.12.2020 07:00 02.12.2020 16:00 02.12.2020 16:00 Shift转移 9.0 9.0
1 1 04.12.2020 18:00 04.12.2020 18:00 04.12.2020 21:00 04.12.2020 21:00 Shift转移 3.0 3.0
2 2 05.12.2020 07:00 05.12.2020 07:00 05.12.2020 12:00 05.12.2020 12:00 Shift转移 5.0 5.0
3 3 06.12.2020 09:00 06.12.2020 09:00 06.12.2020 18:00 06.12.2020 18:00 Shift转移 9.0 9.0
4 4 07.12.2020 19:00 07.12.2020 19:00 07.12.2020 23:00 07.12.2020 23:00 Shift转移 4.0 4.0
5 5 08.12.2020 19:00 08.12.2020 19:00 08.12.2020 23:00 08.12.2020 23:00 Shift转移 4.0 4.0
6 6 09.12.2020 10:00 09.12.2020 10:00 09.12.2020 15:00 09.12.2020 15:00 Shift转移 5.0 5.0

As I am a casual at this job I have to take a few things into consideration like penalty rates (baserate, after 6pm on Monday - Friday, Saturday, and Sunday all have different rates).由于我是这份工作的临时工,因此我必须考虑一些事情,例如加班费(基本费率,周一至周五下午 6 点之后,周六和周日都有不同的费率)。 I'm wondering if I can analyse this csv using datetime and calculate how many hours are before 6pm, and how many after 6pm.我想知道我是否可以使用日期时间分析这个 csv 并计算下午 6 点之前有多少小时,以及下午 6 点之后有多少小时。 So using this as an example the output would be like:因此,以此为例,output 将类似于:

Start开始 End结尾 Title标题 Hours小时
1 1 04.12.2020 15:00 04.12.2020 15:00 04.12.2020 21:00 04.12.2020 21:00 Shift转移 6.0 6.0
Start开始 End结尾 Title标题 Total Hours全部小时数 Hours before 3pm下午 3 点前的几个小时 Hours after 3pm下午 3 点后数小时
1 1 04.12.2020 15:00 04.12.2020 15:00 04.12.2020 21:00 04.12.2020 21:00 Shift转移 6.0 6.0 3.0 3.0 3.0 3.0

I can use this to get the day of the week but I'm just not sure how to analyse certain bits of time for penalty rates:我可以使用它来获取星期几,但我只是不确定如何分析某些时间的罚款率:


df['day_of_week'] = df['Start'].dt.day_name()

I appreciate any help in Python or even other coding languages/techniques this can be applied to:)我感谢 Python 或什至可以应用于其他编码语言/技术的任何帮助:)

Edit: This is how my dataframe is looking at the moment编辑:这就是我的 dataframe 现在的样子

Start开始 End结尾 Title标题 Hours小时 day_of_week day_of_week Pay支付 week_of_year week_of_year
0 0 2020-12-02 07:00:00 2020-12-02 07:00:00 2020-12-02 16:00:00 2020-12-02 16:00:00 Shift转移 9.0 9.0 Wednesday周三 337.30 337.30 49 49

EDIT In response to David Erickson's comment.编辑回应大卫埃里克森的评论。

value价值 variable多变的 bool布尔
0 0 2020-12-02 07:00:00 2020-12-02 07:00:00 Start开始 False错误的
1 1 2020-12-02 08:00:00 2020-12-02 08:00:00 Start开始 False错误的
2 2 2020-12-02 09:00:00 2020-12-02 09:00:00 Start开始 False错误的
3 3 2020-12-02 10:00:00 2020-12-02 10:00:00 Start开始 False错误的
4 4 2020-12-02 11:00:00 2020-12-02 11:00:00 Start开始 False错误的
5 5 2020-12-02 12:00:00 2020-12-02 12:00:00 Start开始 False错误的
6 6 2020-12-02 13:00:00 2020-12-02 13:00:00 Start开始 False错误的
7 7 2020-12-02 14:00:00 2020-12-02 14:00:00 Start开始 False错误的
8 8 2020-12-02 15:00:00 2020-12-02 15:00:00 Start开始 False错误的
9 9 2020-12-02 16:00:00 2020-12-02 16:00:00 End结尾 False错误的
10 10 2020-12-04 18:00:00 2020-12-04 18:00:00 Start开始 False错误的
11 11 2020-12-04 19:00:00 2020-12-04 19:00:00 Start开始 True真的
12 12 2020-12-04 20:00:00 2020-12-04 20:00:00 Start开始 True真的
13 13 2020-12-04 21:00:00 2020-12-04 21:00:00 End结尾 True真的
14 14 2020-12-05 07:00:00 2020-12-05 07:00:00 Start开始 False错误的
15 15 2020-12-05 08:00:00 2020-12-05 08:00:00 Start开始 False错误的
16 16 2020-12-05 09:00:00 2020-12-05 09:00:00 Start开始 False错误的
17 17 2020-12-05 10:00:00 2020-12-05 10:00:00 Start开始 False错误的
18 18 2020-12-05 11:00:00 2020-12-05 11:00:00 Start开始 False错误的
19 19 2020-12-05 12:00:00 2020-12-05 12:00:00 End结尾 False错误的
20 20 2020-12-06 09:00:00 2020-12-06 09:00:00 Start开始 False错误的
21 21 2020-12-06 10:00:00 2020-12-06 10:00:00 Start开始 False错误的
22 22 2020-12-06 11:00:00 2020-12-06 11:00:00 Start开始 False错误的
23 23 2020-12-06 12:00:00 2020-12-06 12:00:00 Start开始 False错误的
24 24 2020-12-06 13:00:00 2020-12-06 13:00:00 Start开始 False错误的
25 25 2020-12-06 14:00:00 2020-12-06 14:00:00 Start开始 False错误的
26 26 2020-12-06 15:00:00 2020-12-06 15:00:00 Start开始 False错误的
27 27 2020-12-06 6:00:00 2020-12-06 6:00:00 Start开始 False错误的
28 28 2020-12-06 17:00:00 2020-12-06 17:00:00 Start开始 False错误的
29 29 2020-12-06 18:00:00 2020-12-06 18:00:00 End结尾 False错误的
30 30 2020-12-07 19:00:00 2020-12-07 19:00:00 Start开始 False错误的
31 31 2020-12-07 20:00:00 2020-12-07 20:00:00 Start开始 True真的
32 32 2020-12-07 21:00:00 2020-12-07 21:00:00 Start开始 True真的
33 33 2020-12-07 22:00:00 2020-12-07 22:00:00 Start开始 True真的
34 34 2020-12-07 23:00:00 2020-12-07 23:00:00 End结尾 True真的
35 35 2020-12-08 19:00:00 2020-12-08 19:00:00 Start开始 False错误的
36 36 2020-12-08 20:00:00 2020-12-08 20:00:00 Start开始 True真的
37 37 2020-12-08 21:00:00 2020-12-08 21:00:00 Start开始 True真的
38 38 2020-12-08 22:00:00 2020-12-08 22:00:00 Start开始 True真的
39 39 2020-12-08 23:00:00 2020-12-08 23:00:00 End结尾 True真的
40 40 2020-12-09 10:00:00 2020-12-09 10:00:00 Start开始 False错误的
41 41 2020-12-09 11:00:00 2020-12-09 11:00:00 Start开始 False错误的
42 42 2020-12-09 12:00:00 2020-12-09 12:00:00 Start开始 False错误的
43 43 2020-12-09 13:00:00 2020-12-09 13:00:00 Start开始 False错误的
44 44 2020-12-09 14:00:00 2020-12-09 14:00:00 Start开始 False错误的
45 45 2020-12-09 15:00:00 2020-12-09 15:00:00 End结尾 False错误的
46 46 2020-12-11 19:00:00 2020-12-11 19:00:00 Start开始 False错误的
47 47 2020-12-11 20:00:00 2020-12-11 20:00:00 Start开始 True真的
48 48 2020-12-11 21:00:00 2020-12-11 21:00:00 Start开始 True真的
49 49 2020-12-11 22:00:00 2020-12-11 22:00:00 Start开始 True真的

UPDATE: (2020-12-19)更新:(2020-12-19)

I have simply filtered out the Start rows, as you were correct an extra row wa being calculated.我只是过滤掉了Start行,因为你是正确的,正在计算额外的行。 Also, I passed dayfirst=True to pd.to_datetime() to convert the date correctly.此外,我将dayfirst=True传递给pd.to_datetime()以正确转换日期。 I have also made the output clean with some extra columns.我还用一些额外的列使 output 变得干净。

higher_pay = 40
lower_pay = 30

df['Start'], df['End'] = pd.to_datetime(df['Start'], dayfirst=True), pd.to_datetime(df['End'], dayfirst=True)
start = df['Start']
df1 = df[['Start', 'End']].melt(value_name='Date').set_index('Date')
s = df1.groupby('variable').cumcount()
df1 = df1.groupby(s, group_keys=False).resample('1H').asfreq().join(s.rename('Shift').to_frame()).ffill().reset_index()
df1 = df1[~df1['Date'].isin(start)]
df1['Day'] = df1['Date'].dt.day_name()
df1['Week'] = df1['Date'].dt.isocalendar().week
m = (df1['Date'].dt.hour > 18) | (df1['Day'].isin(['Saturday', 'Sunday']))
df1['Higher Pay Hours'] = np.where(m, 1, 0)
df1['Lower Pay Hours'] = np.where(m, 0, 1)
df1['Pay'] = np.where(m, higher_pay, lower_pay)
df1 = df1.groupby(['Shift', 'Day', 'Week']).sum().reset_index()
df2 = df.merge(df1, how='left', left_index=True, right_on='Shift').drop('Shift', axis=1)
df2
Out[1]: 
                Start                 End  Title  Hours        Day  Week  \
0 2020-12-02 07:00:00 2020-12-02 16:00:00  Shift    9.0  Wednesday    49   
1 2020-12-04 18:00:00 2020-12-04 21:00:00  Shift    3.0     Friday    49   
2 2020-12-05 07:00:00 2020-12-05 12:00:00  Shift    5.0   Saturday    49   
3 2020-12-06 09:00:00 2020-12-06 18:00:00  Shift    9.0     Sunday    49   
4 2020-12-07 19:00:00 2020-12-07 23:00:00  Shift    4.0     Monday    50   
5 2020-12-08 19:00:00 2020-12-08 23:00:00  Shift    4.0    Tuesday    50   
6 2020-12-09 10:00:00 2020-12-09 15:00:00  Shift    5.0  Wednesday    50   

   Higher Pay Hours  Lower Pay Hours  Pay  
0                 0                9  270  
1                 3                0  120  
2                 5                0  200  
3                 9                0  360  
4                 4                0  160  
5                 4                0  160  
6                 0                5  150  

There are probably more concise ways to do this, but I thought resampling the dataframe and then counting the hours would be a clean approach.可能有更简洁的方法可以做到这一点,但我认为重新采样 dataframe 然后计算小时数将是一种干净的方法。 You can melt the dataframe to have Start and End in the same column and fill in the gap hours with resample making sure to groupby by the 'Start' and 'End' values that were initially on the same row.您可以melt groupby以使StartEnd在同一列中,并通过resample填充间隙时间,确保按最初位于同一行的“Start”和“End”值进行分组。 The easiest way to figure out which rows were initially together is to get the cumulative count with cumcount of the values in the new the dataframe grouped by 'Start' and 'End'.找出哪些行最初在一起的最简单方法是获取按“开始”和“结束”分组的新cumcount中的值的累积计数。 I'll show you how this works later in the answer.我将在稍后的答案中向您展示这是如何工作的。

Full Code:完整代码:

df['Start'], df['End'] = pd.to_datetime(df['Start']), pd.to_datetime(df['End'])
df = df[['Start', 'End']].melt().set_index('value')
df = df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26

Adding some more details...添加更多细节...

Step 1: Melt the dataframe: You only need two columns 'Start' and 'End' to get your desired output第 1 步:熔化 dataframe:您只需要两列“开始”和“结束”即可获得所需的 output

df = df[['Start', 'End']].melt().set_index('value')
df
Out[1]: 
                    variable
value                       
2020-02-12 07:00:00    Start
2020-04-12 18:00:00    Start
2020-05-12 07:00:00    Start
2020-06-12 09:00:00    Start
2020-07-12 19:00:00    Start
2020-08-12 19:00:00    Start
2020-09-12 10:00:00    Start
2020-02-12 16:00:00      End
2020-04-12 21:00:00      End
2020-05-12 12:00:00      End
2020-06-12 18:00:00      End
2020-07-12 23:00:00      End
2020-08-12 23:00:00      End
2020-09-12 15:00:00      End

Step 2: Create Group in preparation for resample: *As you can see group 0-6 line up with each other representing ' Start' and 'End' as they were together previously第 2 步:创建组以准备重新采样: *如您所见,组 0-6 彼此对齐,代表“开始”和“结束”,因为它们以前在一起

df.groupby('variable').cumcount()
Out[2]: 
value
2020-02-12 07:00:00    0
2020-04-12 18:00:00    1
2020-05-12 07:00:00    2
2020-06-12 09:00:00    3
2020-07-12 19:00:00    4
2020-08-12 19:00:00    5
2020-09-12 10:00:00    6
2020-02-12 16:00:00    0
2020-04-12 21:00:00    1
2020-05-12 12:00:00    2
2020-06-12 18:00:00    3
2020-07-12 23:00:00    4
2020-08-12 23:00:00    5
2020-09-12 15:00:00    6

Step 3: Resample the data per group by hour to fill in the gaps for each group:第 3 步:按小时对每组数据重新采样,以填补每组的空白:

df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
Out[3]: 
                 value variable
0  2020-02-12 07:00:00    Start
1  2020-02-12 08:00:00    Start
2  2020-02-12 09:00:00    Start
3  2020-02-12 10:00:00    Start
4  2020-02-12 11:00:00    Start
5  2020-02-12 12:00:00    Start
6  2020-02-12 13:00:00    Start
7  2020-02-12 14:00:00    Start
8  2020-02-12 15:00:00    Start
9  2020-02-12 16:00:00      End
10 2020-04-12 18:00:00    Start
11 2020-04-12 19:00:00    Start
12 2020-04-12 20:00:00    Start
13 2020-04-12 21:00:00      End
14 2020-05-12 07:00:00    Start
15 2020-05-12 08:00:00    Start
16 2020-05-12 09:00:00    Start
17 2020-05-12 10:00:00    Start
18 2020-05-12 11:00:00    Start
19 2020-05-12 12:00:00      End
20 2020-06-12 09:00:00    Start
21 2020-06-12 10:00:00    Start
22 2020-06-12 11:00:00    Start
23 2020-06-12 12:00:00    Start
24 2020-06-12 13:00:00    Start
25 2020-06-12 14:00:00    Start
26 2020-06-12 15:00:00    Start
27 2020-06-12 16:00:00    Start
28 2020-06-12 17:00:00    Start
29 2020-06-12 18:00:00      End
30 2020-07-12 19:00:00    Start
31 2020-07-12 20:00:00    Start
32 2020-07-12 21:00:00    Start
33 2020-07-12 22:00:00    Start
34 2020-07-12 23:00:00      End
35 2020-08-12 19:00:00    Start
36 2020-08-12 20:00:00    Start
37 2020-08-12 21:00:00    Start
38 2020-08-12 22:00:00    Start
39 2020-08-12 23:00:00      End
40 2020-09-12 10:00:00    Start
41 2020-09-12 11:00:00    Start
42 2020-09-12 12:00:00    Start
43 2020-09-12 13:00:00    Start
44 2020-09-12 14:00:00    Start
45 2020-09-12 15:00:00      End

Step 4 - From there, you can calculate the boolean series I have called m : *True values represent conditions met for "Higher Rate".第 4 步 - 从那里,您可以计算我称之为m的 boolean 系列: *真值表示满足“更高速率”的条件。

m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
m
Out[4]: 
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10     True
11     True
12     True
13     True
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30     True
31     True
32     True
33     True
34     True
35     True
36     True
37     True
38     True
39     True
40     True
41     True
42     True
43     True
44     True
45     True

Step 5: Filter the dataframe by True or False to count total hours for the Normal Rate and Higher Rate and print values.第 5 步:按TrueFalse过滤 dataframe 以计算 Normal Rate 和 Higher Rate 的总小时数并打印值。

print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用python datetime在早上9点到晚上6点之间划分时间到3小时的时间间隔 - divide time between morning 9AM to evening 6PM in to 3hrs time interval using python datetime 如何检查是否是某个时间,例如下午 2:00 - How to check if it's a certain time like e.g. 2:00pm 如何查看datetime是否在熊猫中不同数据框的开始时间和结束时间之间 - how to see if datetime is between start and end time of different dataframe in pandas 如何在python中找到事件的开始时间和结束时间? - How to find the start time and end time of an event in python? 如何将功能“附加”到 Python 中的对象,例如 Pandas DataFrame? - How to "attach" functionality to objects in Python e.g. to pandas DataFrame? 如何在python中按类查找事件组的开始时间和结束时间? - How to find the start time and end time of an event group by class in python? python pandas dataframe 填充,例如 bfill、ffill - python pandas dataframe filling e.g. bfill, ffill 如何测试pandas.Series是否仅包含某些类型(例如int)? - How to test whether pandas.Series contains only certain type (e.g. int)? 使用 Python/Pandas 以 csv 中的开始时间和结束时间日期时间列按小时分组 - Group by hour with start time and end time datetime columns in csv with Python/Pandas 如何将零值添加到以日期时间为索引的 Pandas 数据框,例如用于后续绘图 - How to add zero values to datetime-indexed Pandas dataframe, e.g. for subsequent graphing
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM