[英]Merge pandas DataFrames by partial match in datetime column
Hello good people of stackoverflow.你好,stackoverflow 的好人。 I can't quite grasp the solution here, so please, help me out.我不能完全掌握这里的解决方案,所以请帮助我。 Please, keep in mind that I'm quite a beginner at python, so please, keep it as simple as you can.请记住,我是 Python 的初学者,所以请尽可能简单。
My company provides employees with transportation to and from work.我公司为员工提供上下班交通。 There is a system in place that tracks when employee got on the bus and which bus the person got onto.有一个系统可以跟踪员工何时上公共汽车以及该人上哪辆公共汽车。 Also we receive data from transportation company with information where and when employees were supposed to go as per planning(every employee books the spot in advance).我们还从运输公司接收数据,其中包含员工按照计划应该去的地点和时间的信息(每个员工都提前预订了地点)。 Sometimes people don't book places, sometimes they get onto the wrong bus(not the route they booked) or at the wrong time.有时人们不预订地点,有时他们上错了公共汽车(不是他们预订的路线)或在错误的时间。 My goal is to find such people and provide a report.我的目标是找到这样的人并提供报告。
Here is the sample of the data we receive from the transportation company这是我们从运输公司收到的数据样本
IDs DepartureTime Destination
13519 2019-12-15 16:15:00.000000 100 DefaultCity
10977 2019-12-15 16:15:00.000000 200 DefaultCity_2
13329 2019-12-15 16:15:00.000000 300 DefaultCity_3
14597 2019-12-16 16:15:00.000000 200 DefaultCity_2
16899 2019-12-16 16:15:00.000000 400 DefaultCity_4
14616 2019-12-16 16:15:00.000000 300 DefaultCity_3
12519 2019-12-17 16:15:00.000000 800 DefaultCity_8
11347 2019-12-17 16:15:00.000000 200 DefaultCity_2
Here is the sample of the factual data we receive from tracking system这是我们从跟踪系统收到的事实数据样本
EmployeeID DepartureTime Destination
3027199 2019-12-15 16:12:53.000000 800 DefaultCity_8
3022569 2019-12-15 19:11:24.000000 200 DefaultCity_2
3672468 2019-12-15 16:22:46.000000 300 DefaultCity_3
3027419 2019-12-16 16:12:53.000000 800 DefaultCity_8
3045129 2019-12-16 16:11:24.000000 400 DefaultCity_4
3869438 2019-12-16 16:22:46.000000 300 DefaultCity_3
3487645 2019-12-17 16:12:53.000000 800 DefaultCity_8
3345935 2019-12-17 19:11:24.000000 200 DefaultCity_2
3235128 2019-12-17 16:22:46.000000 300 DefaultCity_3
Also I have an SQL table that helps me bind IDs to EmployeeID我还有一个 SQL 表,可以帮助我将 ID 绑定到 EmployeeID
EmployeeID name IDs
3027199 Alice Doe 13519
3022569 Bob Doe 10977
3672468 Karl Doe 13329
3027419 Mark Doe 14597
3045129 Jenna Doe 16899
3869438 Victoria Doe 14616
3487645 Vladimir Doe 12519
3345935 Kenny Doe 11347
3235128 Heather Doe 14403
It is worth mentioning that "planned" data is present for every working date, but "factual" is not, since company only performs spontaneous spot checks.值得一提的是,每个工作日期都有“计划”数据,但“事实”数据没有,因为公司只进行自发的抽查。
What did I manage:我做了什么:
What I'm struggling to do:我正在努力做的事情:
I'll be glad to provide you with any additional info that may help.我很乐意为您提供任何可能有帮助的其他信息。 Thank you in advance.先感谢您。
You can do as follows.您可以执行以下操作。
In the code below, the first df is named as df_booking
,the second df is named as df_actual
& the SQL database as df_info
.在下面的代码中,第一个 df 命名为df_booking
,第二个 df 命名为df_actual
,SQL 数据库命名为df_info
。
df_booking.rename(columns={'DepartureTime':'DepartureTime_booking', 'Destination':'Destination_booking'}, inplace=True)
df_booking = df_booking.merge(df_info, on='IDs')
df_actual.rename(columns={'DepartureTime':'DepartureTime_actual', 'Destination':'Destination_actual'}, inplace=True)
df_actual = df_actual.merge(df_info, on='EmployeeID')
df_anomoly = df_actual.merge(df_booking, on='EmployeeID', how = 'inner',suffixes=('', '_y') )
df_anomoly['diff_dest'] = np.where(df_anomoly['Destination_actual'].str.extract('(\d+)')!=df_anomoly['Destination_booking'].str.extract('(\d+)'),'Yes','No')
df_anomoly['diff_time']=np.where(pd.to_datetime(df_anomoly['DepartureTime_actual']).dt.floor("30min")!=pd.to_datetime(df_anomoly['DepartureTime_booking']).dt.floor("30min"),'Yes','No')
df_anomoly.drop(list(df_anomoly.filter(regex='_y$')), axis=1, inplace=True)
df_anomoly
print(df_anomoly)
Output输出
EmployeeID DepartureTime_actual Destination_actual name IDs DepartureTime_booking Destination_booking diff_dest diff_time
0 3027199 12/15/2019 16:12 800 DefaultCity_8 Alice Doe 13519 12/15/2019 16:15 100 DefaultCity Yes No
1 3022569 12/15/2019 19:11 200 DefaultCity_2 Bob Doe 10977 12/15/2019 16:15 200 DefaultCity_2 No Yes
2 3672468 12/15/2019 16:22 300 DefaultCity_3 Karl Doe 13329 12/15/2019 16:15 300 DefaultCity_3 No No
3 3027419 12/16/2019 16:12 800 DefaultCity_8 Mark Doe 14597 12/16/2019 16:15 200 DefaultCity_2 Yes No
4 3045129 12/16/2019 16:11 400 DefaultCity_4 Jenna Doe 16899 12/16/2019 16:15 400 DefaultCity_4 No No
5 3869438 12/16/2019 16:22 300 DefaultCity_3 Victoria Doe 14616 12/16/2019 16:15 300 DefaultCity_3 No No
6 3487645 12/17/2019 16:12 800 DefaultCity_8 Vladimir Doe 12519 12/17/2019 16:15 800 DefaultCity_8 No No
7 3345935 12/17/2019 19:11 200 DefaultCity_2 Kenny Doe 11347 12/17/2019 16:15 200 DefaultCity_2 No Yes
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.