[英]how to do lookup on two pandas dataframe and update its value in first dataframe column from another dataframe column?
I have 2 dataframes df and df1 -我有 2 个数据帧 df 和 df1 -
df- df-
|system_time|status|id|date|
|2022-03-04T07:52:26Z|Pending|772|2022-03-04 07:52:26+00:00|
|2022-06-22T17:52:42Z|Pending|963|2022-06-22 17:52:42+00:00|
|2022-08-13T01:34:44Z|Pending|1052|2022-08-13 01:34:44+00:00|
|2022-08-24T01:46:31.115Z|Complete|1052|2022-08-24 01:46:31.115000+00:00|
|2022-08-14T06:04:54.736Z|Pending|1053|2022-08-14 06:04:54.736000+00:00|
|2022-03-04T17:51:15.025Z|Pending|772|2022-03-04 17:51:15.025000+00:00|
|2022-08-24T06:24:54.736Z|Inprogress|999|2022-08-24 06:24:54.736000+00:00|
df1- df1-
|id|task_status|
|1052|Complete|
|889|Pending|
|772|Complete|
|963|Pending|
Type of columns in df - df 中的列类型 -
system_time - object
status - object
id - int64
date - object
I want to apply a lookup here from df into df1.我想在此处应用从 df 到 df1 的查找。 If id matches in df and df1,status in df should be of task_status from df1.
如果 df 和 df1 中的 id 匹配,df 中的状态应该是 df1 中的 task_status。 As there are duplicate records in df, need to get the latest record and update the status as of df1 else keep status same as df for unmatched id's.
由于 df 中有重复记录,需要获取最新记录并更新 df1 的状态,否则对于不匹配的 id,保持与 df 相同的状态。 In df, I have converted the system_time into date column using -
在 df 中,我使用 - 将 system_time 转换为日期列
df['date']=pd.to_datetime(df['system_time'])
Expected output -预计 output -
|system_time|status|id|date|
|2022-06-22T17:52:42Z|Pending|963|2022-06-22 17:52:42+00:00|
|2022-08-24T01:46:31.115Z|Complete|1052|2022-08-24 01:46:31.115000+00:00|
|2022-08-14T06:04:54.736Z|Pending|1053|2022-08-14 06:04:54.736000+00:00|
|2022-03-04T17:51:15.025Z|Complete|772|2022-03-04 17:51:15.025000+00:00|
|2022-08-24T06:24:54.736Z|Inprogress|999|2022-08-24 06:24:54.736000+00:00|
here is one way to do it using map这是使用 map 的一种方法
# map the df1 status to df when ID is found
df['status']=df['i'].map(df1.set_index(['id'])['task_status'])
df
system_time status id date
0 2022-03-04T07:52:26Z Complete 772 2022-03-04 07:52:26+00:00
1 2022-06-22T17:52:42Z Pending 963 2022-06-22 17:52:42+00:00
2 2022-08-13T01:34:44Z Complete 1052 2022-08-13 01:34:44+00:00
3 2022-08-24T01:46:31.115Z Complete 1052 2022-08-24 01:46:31.115000+00:00
4 2022-08-14T06:04:54.736Z NaN 1053 2022-08-14 06:04:54.736000+00:00
5 2022-03-04T17:51:15.025Z Complete 772 2022-03-04 17:51:15.025000+00:00
6 2022-08-24T06:24:54.736Z NaN 999 2022-08-24 06:24:54.736000+00:00
Alternately, if you like to update only when status is found in DF1或者,如果您只想在 DF1 中找到状态时更新
df['status']=df['status'].mask((df['id'].map(df1.set_index(['id'])['task_status']).notna()),
(df['id'].map(df1.set_index(['id'])['task_status'])) )
df
system_time status id date
0 2022-03-04T07:52:26Z Complete 772 2022-03-04 07:52:26+00:00
1 2022-06-22T17:52:42Z Pending 963 2022-06-22 17:52:42+00:00
2 2022-08-13T01:34:44Z Complete 1052 2022-08-13 01:34:44+00:00
3 2022-08-24T01:46:31.115Z Complete 1052 2022-08-24 01:46:31.115000+00:00
4 2022-08-14T06:04:54.736Z Pending 1053 2022-08-14 06:04:54.736000+00:00
5 2022-03-04T17:51:15.025Z Complete 772 2022-03-04 17:51:15.025000+00:00
6 2022-08-24T06:24:54.736Z Inprogress 999 2022-08-24 06:24:54.736000+00:00
for unmatched,对于无与伦比的,
#update 'unmatched' column as unmatched when id is NOT found.
#When its found, keep the status as-is.
#you may want to keep the previous one and this one together
df['unmatched']=df['status'].mask((df['id'].map(df1.set_index(['id'])['task_status']).isna()),
'unmatched' )
df
system_time status id date unmatched
0 2022-03-04T07:52:26Z Complete 772 2022-03-04 07:52:26+00:00 Complete
1 2022-06-22T17:52:42Z Pending 963 2022-06-22 17:52:42+00:00 Pending
2 2022-08-13T01:34:44Z Complete 1052 2022-08-13 01:34:44+00:00 Complete
3 2022-08-24T01:46:31.115Z Complete 1052 2022-08-24 01:46:31.115000+00:00 Complete
4 2022-08-14T06:04:54.736Z Pending 1053 2022-08-14 06:04:54.736000+00:00 unmatched
5 2022-03-04T17:51:15.025Z Complete 772 2022-03-04 17:51:15.025000+00:00 Complete
6 2022-08-24T06:24:54.736Z Inprogress 999 2022-08-24 06:24:54.736000+00:00 unmatched
to keep the last row based on the system-time根据系统时间保留最后一行
df.sort_values('system_time').drop_duplicates(subset=['id'], keep='last')
system_time status id date unmatched
5 2022-03-04T17:51:15.025Z Pending 772 2022-03-04 17:51:15.025000+00:00 Pending
1 2022-06-22T17:52:42Z Pending 963 2022-06-22 17:52:42+00:00 Pending
4 2022-08-14T06:04:54.736Z Pending 1053 2022-08-14 06:04:54.736000+00:00 unmatched
3 2022-08-24T01:46:31.115Z Complete 1052 2022-08-24 01:46:31.115000+00:00 Complete
6 2022-08-24T06:24:54.736Z Inprogress 999 2022-08-24 06:24:54.736000+00:00 unmatched
You need to use merge
:您需要使用
merge
:
df = df.merge(right=df1, on='id',how='left')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.