[英]How to merge two dataframes based on a date condition from two different date columns in each dataframe?
I have two Dataframes in the form:我有两个数据框的形式:
Dataframe(df1):数据框(df1):
P_CLIENT_ID ![]() |
P_DATE_ENCOUNTER ![]() |
---|---|
25835 ![]() |
2016-12-21 ![]() |
25835 ![]() |
2017-02-21 ![]() |
25835 ![]() |
2017-04-25 ![]() |
25835 ![]() |
2017-06-21 ![]() |
25835 ![]() |
2017-09-04 ![]() |
25835 ![]() |
2018-01-08 ![]() |
25835 ![]() |
2018-04-03 ![]() |
Dataframe(df2):数据框(df2):
R_CLIENT_ID ![]() |
R_DATE_TESTED ![]() |
R_RESULT ![]() |
---|---|---|
25835 ![]() |
2017-03-07 ![]() |
20.0 ![]() |
25835 ![]() |
2017-08-03 ![]() |
20.0 ![]() |
25835 ![]() |
2018-03-23 ![]() |
20.0 ![]() |
25835 ![]() |
2019-06-28 ![]() |
20.0 ![]() |
25835 ![]() |
2019-08-19 ![]() |
42.0 ![]() |
25835 ![]() |
2020-04-20 ![]() |
40.0 ![]() |
25835 ![]() |
2021-06-03 ![]() |
20.0 ![]() |
I want to merge df2 onto df1 (main table) with the join key being P_CLIENT_ID
and R_CLIENT_ID
appending the most recent R_DATE_TESTED
and R_RESULT
我想将 df2 合并到 df1(主表)上,连接键为
P_CLIENT_ID
和R_CLIENT_ID
,附加最新的R_DATE_TESTED
和R_RESULT
First Condition: If R_DATE_TESTED > P_DATE_ENCOUNTER
then nullify the R_DATE_TESTED, R_RESULT
fields.第一个条件:如果
R_DATE_TESTED > P_DATE_ENCOUNTER
则使R_DATE_TESTED, R_RESULT
字段无效。
Second Condition: If R_DATE_TESTED < P_DATE_ENCOUNTER
then appned the most recent R_DATE_TESTED, R_RESULT
fields to the dataframe with the final results as:第二个条件:如果
R_DATE_TESTED < P_DATE_ENCOUNTER
然后将最近的R_DATE_TESTED, R_RESULT
字段附加到 dataframe,最终结果为:
The result from logic should be as below:逻辑结果应如下所示:
P_CLIENT_ID ![]() |
R_CLIENT_ID ![]() |
P_DATE_ENCOUNTER ![]() |
R_DATE_TESTED ![]() |
R_RESULT ![]() |
---|---|---|---|---|
25835 ![]() |
25835.0 ![]() |
2016-12-21 ![]() |
NaN![]() |
NaN![]() |
25835 ![]() |
25835.0 ![]() |
2017-02-21 ![]() |
NaN![]() |
NaN![]() |
25835 ![]() |
25835.0 ![]() |
2017-04-25 ![]() |
2017-03-07 ![]() |
20.0 ![]() |
25835 ![]() |
25835.0 ![]() |
2017-06-21 ![]() |
2017-03-07 ![]() |
20.0 ![]() |
25835 ![]() |
25835.0 ![]() |
2017-09-04 ![]() |
2017-08-03 ![]() |
20.0 ![]() |
25835 ![]() |
25835.0 ![]() |
2018-01-08 ![]() |
2017-08-03 ![]() |
20.0 ![]() |
25835 ![]() |
25835.0 ![]() |
2018-04-03 ![]() |
2018-03-23 ![]() |
20.0 ![]() |
Note: The actual dataset is quite large: df1 ~ 700000 rows and df2 ~ 125000 rows注意:实际数据集相当大:df1 ~ 700000 行和 df2 ~ 125000 行
Code attempt代码尝试
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'P_CLIENT_D': ['25835','25835','25835','25835','25835','25835','25835'],
'P_DATE_ENCOUNTER': ['2016-12-21','2017-02-21','2017-04-25','2017-06-21','2017-09-04','2018-01-08','2018-04-03']})
df2 = pd.DataFrame({'R_CLIENT_ID': ['25835','25835','25835','25835','25835','25835','25835'],
'R_DATE_TESTED': ['2017-03-07','2017-08-03','2018-03-23','2019-06-28','2019-08-19','2020-04-20','2021-06-03'],
'R_RESULT':[20,20,20,20,42,40,20]})
df_merged = pd.merge(df1, df2, left_on=['P_CLIENT_D'], right_on = ['R_CLIENT_ID'], how='left')
df_merged = df_merged.drop_duplicates(subset=['P_CLIENT_D', 'P_DATE_ENCOUNTER'], keep='last')
df_merged['FLAG_LAB_AFTER_VISIT'] = 0
df_merged.loc[df_merged.R_DATE_TESTED >= df_merged.P_DATE_ENCOUNTER,'FLAG_LAB_AFTER_VISIT']=1
print(df_merged['FLAG_LAB_AFTER_VISIT'].sum(), 'future labs set to null')
#now the rows with flags - set all lab fields to null
df_merged.loc[df_merged['FLAG_LAB_AFTER_VISIT']==1, df2.columns] = np.nan
Try with pandas.merge_asof
:尝试使用
pandas.merge_asof
:
>>> pd.merge_asof(df1,
df2,
left_on="P_DATE_ENCOUNTER",
right_on="R_DATE_TESTED",
left_by="P_CLIENT_ID",
right_by="R_CLIENT_ID")
P_CLIENT_ID P_DATE_ENCOUNTER R_CLIENT_ID R_DATE_TESTED R_RESULT
0 25835 2016-12-21 NaN NaT NaN
1 25835 2017-02-21 NaN NaT NaN
2 25835 2017-04-25 25835.0 2017-03-07 20.0
3 25835 2017-06-21 25835.0 2017-03-07 20.0
4 25835 2017-09-04 25835.0 2017-08-03 20.0
5 25835 2018-01-08 25835.0 2017-08-03 20.0
6 25835 2018-04-03 25835.0 2018-03-23 20.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.