通过日期时间列中的部分匹配合并 Pandas DataFrames

Question

Hello good people of stackoverflow.你好，stackoverflow 的好人。 I can't quite grasp the solution here, so please, help me out.我不能完全掌握这里的解决方案，所以请帮助我。 Please, keep in mind that I'm quite a beginner at python, so please, keep it as simple as you can.请记住，我是 Python 的初学者，所以请尽可能简单。

My company provides employees with transportation to and from work.我公司为员工提供上下班交通。 There is a system in place that tracks when employee got on the bus and which bus the person got onto.有一个系统可以跟踪员工何时上公共汽车以及该人上哪辆公共汽车。 Also we receive data from transportation company with information where and when employees were supposed to go as per planning(every employee books the spot in advance).我们还从运输公司接收数据，其中包含员工按照计划应该去的地点和时间的信息（每个员工都提前预订了地点）。 Sometimes people don't book places, sometimes they get onto the wrong bus(not the route they booked) or at the wrong time.有时人们不预订地点，有时他们上错了公共汽车（不是他们预订的路线）或在错误的时间。 My goal is to find such people and provide a report.我的目标是找到这样的人并提供报告。

Here is the sample of the data we receive from the transportation company这是我们从运输公司收到的数据样本

IDs     DepartureTime               Destination
13519   2019-12-15 16:15:00.000000  100 DefaultCity
10977   2019-12-15 16:15:00.000000  200 DefaultCity_2
13329   2019-12-15 16:15:00.000000  300 DefaultCity_3
14597   2019-12-16 16:15:00.000000  200 DefaultCity_2
16899   2019-12-16 16:15:00.000000  400 DefaultCity_4
14616   2019-12-16 16:15:00.000000  300 DefaultCity_3
12519   2019-12-17 16:15:00.000000  800 DefaultCity_8
11347   2019-12-17 16:15:00.000000  200 DefaultCity_2

Here is the sample of the factual data we receive from tracking system这是我们从跟踪系统收到的事实数据样本

EmployeeID     DepartureTime                Destination
3027199        2019-12-15 16:12:53.000000   800 DefaultCity_8
3022569        2019-12-15 19:11:24.000000   200 DefaultCity_2
3672468        2019-12-15 16:22:46.000000   300 DefaultCity_3
3027419        2019-12-16 16:12:53.000000   800 DefaultCity_8
3045129        2019-12-16 16:11:24.000000   400 DefaultCity_4
3869438        2019-12-16 16:22:46.000000   300 DefaultCity_3
3487645        2019-12-17 16:12:53.000000   800 DefaultCity_8
3345935        2019-12-17 19:11:24.000000   200 DefaultCity_2
3235128        2019-12-17 16:22:46.000000   300 DefaultCity_3

Also I have an SQL table that helps me bind IDs to EmployeeID我还有一个 SQL 表，可以帮助我将 ID 绑定到 EmployeeID

EmployeeID     name                  IDs
3027199        Alice Doe             13519  
3022569        Bob Doe               10977  
3672468        Karl Doe              13329  
3027419        Mark Doe              14597  
3045129        Jenna Doe             16899  
3869438        Victoria Doe          14616 
3487645        Vladimir Doe          12519  
3345935        Kenny Doe             11347  
3235128        Heather Doe           14403

It is worth mentioning that "planned" data is present for every working date, but "factual" is not, since company only performs spontaneous spot checks.值得一提的是，每个工作日期都有“计划”数据，但“事实”数据没有，因为公司只进行自发的抽查。

What did I manage:我做了什么：

Filter "planned data" and "factual" within a certain date range by using pyjanitor+pandas and df.filter_date function使用 pyjanitor+pandas 和 df.filter_date 函数过滤特定日期范围内的“计划数据”和“事实”
Merge Names, Ids and EmployeeIDs合并姓名、ID 和员工 ID

What I'm struggling to do:我正在努力做的事情：

Merge "planned" with "factual" without including dates present in "planned" but absent in "factual"将“计划”与“事实”合并，但不包括“计划”中存在但“事实”中不存在的日期
Actually find the people by mismatch in time/Destination between "planned and "factual" data. Please, note that I want to consider time frame, say 16:01 - 16:29 as 16:15 when comparing "planned" and "factual" and show only people who got onto the bus at different hour.实际上通过“计划数据”和“事实”数据之间的时间/目的地不匹配来找到人员。请注意，在比较“计划”和“事实”数据时，我想考虑时间范围，比如 16:01 - 16:29 为 16:15 "并且只显示在不同时间上车的人。
Find people who didn't book at all.查找根本没有预订的人。 There will be no data in "planned" regarding them at all, but there will be in "factual"在“计划”中根本没有关于它们的数据，但在“事实”中会有

Expected output预期输出

I'll be glad to provide you with any additional info that may help.我很乐意为您提供任何可能有帮助的其他信息。 Thank you in advance.先感谢您。

Answer 1

You can do as follows.您可以执行以下操作。

In the code below, the first df is named as df_booking ,the second df is named as df_actual & the SQL database as df_info .在下面的代码中，第一个 df 命名为df_booking ，第二个 df 命名为df_actual ，SQL 数据库命名为df_info 。

df_booking.rename(columns={'DepartureTime':'DepartureTime_booking', 'Destination':'Destination_booking'}, inplace=True)
df_booking = df_booking.merge(df_info, on='IDs')

df_actual.rename(columns={'DepartureTime':'DepartureTime_actual', 'Destination':'Destination_actual'}, inplace=True)
df_actual = df_actual.merge(df_info, on='EmployeeID')

df_anomoly = df_actual.merge(df_booking, on='EmployeeID', how = 'inner',suffixes=('', '_y') )
df_anomoly['diff_dest'] = np.where(df_anomoly['Destination_actual'].str.extract('(\d+)')!=df_anomoly['Destination_booking'].str.extract('(\d+)'),'Yes','No')
df_anomoly['diff_time']=np.where(pd.to_datetime(df_anomoly['DepartureTime_actual']).dt.floor("30min")!=pd.to_datetime(df_anomoly['DepartureTime_booking']).dt.floor("30min"),'Yes','No')
df_anomoly.drop(list(df_anomoly.filter(regex='_y$')), axis=1, inplace=True)
df_anomoly
print(df_anomoly)

Output输出

EmployeeID  DepartureTime_actual    Destination_actual  name    IDs     DepartureTime_booking   Destination_booking     diff_dest   diff_time
0   3027199     12/15/2019 16:12    800 DefaultCity_8   Alice Doe   13519   12/15/2019 16:15    100 DefaultCity     Yes     No
1   3022569     12/15/2019 19:11    200 DefaultCity_2   Bob Doe     10977   12/15/2019 16:15    200 DefaultCity_2   No  Yes
2   3672468     12/15/2019 16:22    300 DefaultCity_3   Karl Doe    13329   12/15/2019 16:15    300 DefaultCity_3   No  No
3   3027419     12/16/2019 16:12    800 DefaultCity_8   Mark Doe    14597   12/16/2019 16:15    200 DefaultCity_2   Yes     No
4   3045129     12/16/2019 16:11    400 DefaultCity_4   Jenna Doe   16899   12/16/2019 16:15    400 DefaultCity_4   No  No
5   3869438     12/16/2019 16:22    300 DefaultCity_3   Victoria Doe    14616   12/16/2019 16:15    300 DefaultCity_3   No  No
6   3487645     12/17/2019 16:12    800 DefaultCity_8   Vladimir Doe    12519   12/17/2019 16:15    800 DefaultCity_8   No  No
7   3345935     12/17/2019 19:11    200 DefaultCity_2   Kenny Doe   11347   12/17/2019 16:15    200 DefaultCity_2   No  Yes

通过日期时间列中的部分匹配合并 Pandas DataFrames

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-02-09 11:56:53

通过日期时间列中的部分匹配合并 Pandas DataFrames

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-02-09 11:56:53

解决方案1
1 已采纳 2020-02-09 11:56:53