簡體   English   中英

如何查找兩個 pandas dataframe 並從另一個 dataframe 列更新第一個 dataframe 列中的值?

[英]how to do lookup on two pandas dataframe and update its value in first dataframe column from another dataframe column?

我有 2 個數據幀 df 和 df1 -

df-

|system_time|status|id|date|
|2022-03-04T07:52:26Z|Pending|772|2022-03-04 07:52:26+00:00|
|2022-06-22T17:52:42Z|Pending|963|2022-06-22 17:52:42+00:00|
|2022-08-13T01:34:44Z|Pending|1052|2022-08-13 01:34:44+00:00|
|2022-08-24T01:46:31.115Z|Complete|1052|2022-08-24 01:46:31.115000+00:00|
|2022-08-14T06:04:54.736Z|Pending|1053|2022-08-14 06:04:54.736000+00:00|
|2022-03-04T17:51:15.025Z|Pending|772|2022-03-04 17:51:15.025000+00:00|
|2022-08-24T06:24:54.736Z|Inprogress|999|2022-08-24 06:24:54.736000+00:00|

df1-

|id|task_status|
|1052|Complete|
|889|Pending|
|772|Complete|
|963|Pending|

df 中的列類型 -

system_time  - object
status - object
id    - int64
date   - object

我想在此處應用從 df 到 df1 的查找。 如果 df 和 df1 中的 id 匹配,df 中的狀態應該是 df1 中的 task_status。 由於 df 中有重復記錄,需要獲取最新記錄並更新 df1 的狀態,否則對於不匹配的 id,保持與 df 相同的狀態。 在 df 中,我使用 - 將 system_time 轉換為日期列

df['date']=pd.to_datetime(df['system_time'])

預計 output -

|system_time|status|id|date|
|2022-06-22T17:52:42Z|Pending|963|2022-06-22 17:52:42+00:00|
|2022-08-24T01:46:31.115Z|Complete|1052|2022-08-24 01:46:31.115000+00:00|
|2022-08-14T06:04:54.736Z|Pending|1053|2022-08-14 06:04:54.736000+00:00|
|2022-03-04T17:51:15.025Z|Complete|772|2022-03-04 17:51:15.025000+00:00|
|2022-08-24T06:24:54.736Z|Inprogress|999|2022-08-24 06:24:54.736000+00:00|

這是使用 map 的一種方法

# map the df1 status to df when ID is found
df['status']=df['i'].map(df1.set_index(['id'])['task_status'])
df
    system_time     status  id  date
0   2022-03-04T07:52:26Z    Complete    772     2022-03-04 07:52:26+00:00
1   2022-06-22T17:52:42Z    Pending     963     2022-06-22 17:52:42+00:00
2   2022-08-13T01:34:44Z    Complete    1052    2022-08-13 01:34:44+00:00
3   2022-08-24T01:46:31.115Z    Complete    1052    2022-08-24 01:46:31.115000+00:00
4   2022-08-14T06:04:54.736Z    NaN     1053    2022-08-14 06:04:54.736000+00:00
5   2022-03-04T17:51:15.025Z    Complete    772     2022-03-04 17:51:15.025000+00:00
6   2022-08-24T06:24:54.736Z    NaN     999     2022-08-24 06:24:54.736000+00:00

或者,如果您只想在 DF1 中找到狀態時更新

df['status']=df['status'].mask((df['id'].map(df1.set_index(['id'])['task_status']).notna()), 
                           (df['id'].map(df1.set_index(['id'])['task_status'])) )
df
    system_time     status  id  date
0   2022-03-04T07:52:26Z    Complete    772     2022-03-04 07:52:26+00:00
1   2022-06-22T17:52:42Z    Pending     963     2022-06-22 17:52:42+00:00
2   2022-08-13T01:34:44Z    Complete    1052    2022-08-13 01:34:44+00:00
3   2022-08-24T01:46:31.115Z    Complete    1052    2022-08-24 01:46:31.115000+00:00
4   2022-08-14T06:04:54.736Z    Pending     1053    2022-08-14 06:04:54.736000+00:00
5   2022-03-04T17:51:15.025Z    Complete    772     2022-03-04 17:51:15.025000+00:00
6   2022-08-24T06:24:54.736Z    Inprogress  999     2022-08-24 06:24:54.736000+00:00

對於無與倫比的,

#update 'unmatched' column as unmatched when id is NOT found. 
#When its found, keep the status as-is.
#you may want to keep the previous one and this one together

df['unmatched']=df['status'].mask((df['id'].map(df1.set_index(['id'])['task_status']).isna()), 
                           'unmatched' )
df
system_time     status  id  date    unmatched
0   2022-03-04T07:52:26Z    Complete    772     2022-03-04 07:52:26+00:00   Complete
1   2022-06-22T17:52:42Z    Pending     963     2022-06-22 17:52:42+00:00   Pending
2   2022-08-13T01:34:44Z    Complete    1052    2022-08-13 01:34:44+00:00   Complete
3   2022-08-24T01:46:31.115Z    Complete    1052    2022-08-24 01:46:31.115000+00:00    Complete
4   2022-08-14T06:04:54.736Z    Pending     1053    2022-08-14 06:04:54.736000+00:00    unmatched
5   2022-03-04T17:51:15.025Z    Complete    772     2022-03-04 17:51:15.025000+00:00    Complete
6   2022-08-24T06:24:54.736Z    Inprogress  999     2022-08-24 06:24:54.736000+00:00    unmatched

根據系統時間保留最后一行

df.sort_values('system_time').drop_duplicates(subset=['id'], keep='last')
system_time     status  id  date    unmatched
5   2022-03-04T17:51:15.025Z    Pending     772     2022-03-04 17:51:15.025000+00:00    Pending
1   2022-06-22T17:52:42Z    Pending     963     2022-06-22 17:52:42+00:00   Pending
4   2022-08-14T06:04:54.736Z    Pending     1053    2022-08-14 06:04:54.736000+00:00    unmatched
3   2022-08-24T01:46:31.115Z    Complete    1052    2022-08-24 01:46:31.115000+00:00    Complete
6   2022-08-24T06:24:54.736Z    Inprogress  999     2022-08-24 06:24:54.736000+00:00    unmatched

您需要使用merge

df = df.merge(right=df1, on='id',how='left')

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM