如何查找两个 pandas dataframe 并从另一个 dataframe 列更新第一个 dataframe 列中的值？

Question

I have 2 dataframes df and df1 -我有 2 个数据帧 df 和 df1 -

df- df-

|system_time|status|id|date|
|2022-03-04T07:52:26Z|Pending|772|2022-03-04 07:52:26+00:00|
|2022-06-22T17:52:42Z|Pending|963|2022-06-22 17:52:42+00:00|
|2022-08-13T01:34:44Z|Pending|1052|2022-08-13 01:34:44+00:00|
|2022-08-24T01:46:31.115Z|Complete|1052|2022-08-24 01:46:31.115000+00:00|
|2022-08-14T06:04:54.736Z|Pending|1053|2022-08-14 06:04:54.736000+00:00|
|2022-03-04T17:51:15.025Z|Pending|772|2022-03-04 17:51:15.025000+00:00|
|2022-08-24T06:24:54.736Z|Inprogress|999|2022-08-24 06:24:54.736000+00:00|

df1- df1-

|id|task_status|
|1052|Complete|
|889|Pending|
|772|Complete|
|963|Pending|

Type of columns in df - df 中的列类型 -

system_time  - object
status - object
id    - int64
date   - object

I want to apply a lookup here from df into df1.我想在此处应用从 df 到 df1 的查找。 If id matches in df and df1,status in df should be of task_status from df1.如果 df 和 df1 中的 id 匹配，df 中的状态应该是 df1 中的 task_status。 As there are duplicate records in df, need to get the latest record and update the status as of df1 else keep status same as df for unmatched id's.由于 df 中有重复记录，需要获取最新记录并更新 df1 的状态，否则对于不匹配的 id，保持与 df 相同的状态。 In df, I have converted the system_time into date column using -在 df 中，我使用 - 将 system_time 转换为日期列

df['date']=pd.to_datetime(df['system_time'])

Expected output -预计 output -

|system_time|status|id|date|
|2022-06-22T17:52:42Z|Pending|963|2022-06-22 17:52:42+00:00|
|2022-08-24T01:46:31.115Z|Complete|1052|2022-08-24 01:46:31.115000+00:00|
|2022-08-14T06:04:54.736Z|Pending|1053|2022-08-14 06:04:54.736000+00:00|
|2022-03-04T17:51:15.025Z|Complete|772|2022-03-04 17:51:15.025000+00:00|
|2022-08-24T06:24:54.736Z|Inprogress|999|2022-08-24 06:24:54.736000+00:00|

Answer 1

here is one way to do it using map这是使用 map 的一种方法

# map the df1 status to df when ID is found
df['status']=df['i'].map(df1.set_index(['id'])['task_status'])
df

    system_time     status  id  date
0   2022-03-04T07:52:26Z    Complete    772     2022-03-04 07:52:26+00:00
1   2022-06-22T17:52:42Z    Pending     963     2022-06-22 17:52:42+00:00
2   2022-08-13T01:34:44Z    Complete    1052    2022-08-13 01:34:44+00:00
3   2022-08-24T01:46:31.115Z    Complete    1052    2022-08-24 01:46:31.115000+00:00
4   2022-08-14T06:04:54.736Z    NaN     1053    2022-08-14 06:04:54.736000+00:00
5   2022-03-04T17:51:15.025Z    Complete    772     2022-03-04 17:51:15.025000+00:00
6   2022-08-24T06:24:54.736Z    NaN     999     2022-08-24 06:24:54.736000+00:00

Alternately, if you like to update only when status is found in DF1或者，如果您只想在 DF1 中找到状态时更新

df['status']=df['status'].mask((df['id'].map(df1.set_index(['id'])['task_status']).notna()), 
                           (df['id'].map(df1.set_index(['id'])['task_status'])) )
df

    system_time     status  id  date
0   2022-03-04T07:52:26Z    Complete    772     2022-03-04 07:52:26+00:00
1   2022-06-22T17:52:42Z    Pending     963     2022-06-22 17:52:42+00:00
2   2022-08-13T01:34:44Z    Complete    1052    2022-08-13 01:34:44+00:00
3   2022-08-24T01:46:31.115Z    Complete    1052    2022-08-24 01:46:31.115000+00:00
4   2022-08-14T06:04:54.736Z    Pending     1053    2022-08-14 06:04:54.736000+00:00
5   2022-03-04T17:51:15.025Z    Complete    772     2022-03-04 17:51:15.025000+00:00
6   2022-08-24T06:24:54.736Z    Inprogress  999     2022-08-24 06:24:54.736000+00:00

for unmatched,对于无与伦比的，

#update 'unmatched' column as unmatched when id is NOT found. 
#When its found, keep the status as-is.
#you may want to keep the previous one and this one together

df['unmatched']=df['status'].mask((df['id'].map(df1.set_index(['id'])['task_status']).isna()), 
                           'unmatched' )
df

system_time     status  id  date    unmatched
0   2022-03-04T07:52:26Z    Complete    772     2022-03-04 07:52:26+00:00   Complete
1   2022-06-22T17:52:42Z    Pending     963     2022-06-22 17:52:42+00:00   Pending
2   2022-08-13T01:34:44Z    Complete    1052    2022-08-13 01:34:44+00:00   Complete
3   2022-08-24T01:46:31.115Z    Complete    1052    2022-08-24 01:46:31.115000+00:00    Complete
4   2022-08-14T06:04:54.736Z    Pending     1053    2022-08-14 06:04:54.736000+00:00    unmatched
5   2022-03-04T17:51:15.025Z    Complete    772     2022-03-04 17:51:15.025000+00:00    Complete
6   2022-08-24T06:24:54.736Z    Inprogress  999     2022-08-24 06:24:54.736000+00:00    unmatched

to keep the last row based on the system-time根据系统时间保留最后一行

df.sort_values('system_time').drop_duplicates(subset=['id'], keep='last')

system_time     status  id  date    unmatched
5   2022-03-04T17:51:15.025Z    Pending     772     2022-03-04 17:51:15.025000+00:00    Pending
1   2022-06-22T17:52:42Z    Pending     963     2022-06-22 17:52:42+00:00   Pending
4   2022-08-14T06:04:54.736Z    Pending     1053    2022-08-14 06:04:54.736000+00:00    unmatched
3   2022-08-24T01:46:31.115Z    Complete    1052    2022-08-24 01:46:31.115000+00:00    Complete
6   2022-08-24T06:24:54.736Z    Inprogress  999     2022-08-24 06:24:54.736000+00:00    unmatched

Answer 2

You need to use merge :您需要使用merge ：

df = df.merge(right=df1, on='id',how='left')

如何查找两个 pandas dataframe 并从另一个 dataframe 列更新第一个 dataframe 列中的值？

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-10-06 15:34:57

解决方案2
0 2022-10-06 15:21:52

如何查找两个 pandas dataframe 并从另一个 dataframe 列更新第一个 dataframe 列中的值？

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-10-06 15:34:57

解决方案2 0 2022-10-06 15:21:52

解决方案1
1 已采纳 2022-10-06 15:34:57

解决方案2
0 2022-10-06 15:21:52