簡體   English   中英

使用 Pandas 將數據框中的值替換為另一個數據框中的值

[英]Replace values from a dataframe with values from another with Pandas

我有兩個具有相同列的數據框,但值不同且行數不同。

import pandas as pd

data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
         'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
         'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
         'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
         'Price': [500, 400, 0,450,750,0,0,890,500,470,0,415]}

data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia'],
         'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan'],
         'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','DEF','DEF'],
         'Year': [2016, 2017, 2018, 2019,2016, 2017,2016, 2017],
         'Price': [200, 100, 30,750,350,120,400,370]}

df = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

df是完整的數據集,但有一些舊值,而df2只有更新的值。 我想用df2中的值替換df中的所有值,同時保留df中不在df2中的值。

例如,在df中, Country = Japan 的值, Product = DEF 的值, Year = 2016, Price應該從 470 更新到 400。2017 年相同,而 2018 年和 2019 年保持不變。

到目前為止,我有以下似乎不起作用的代碼:

common_index = ['Region','Country','Product','Year']
df = df.set_index(common_index)
df2 = df2.set_index(common_index)
df.update(df2, overwrite = True)

但這只會使用df2中的值更新df並刪除其他所有內容。

預期輸出應如下所示:

data3 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
         'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
         'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
         'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
         'Price': [200, 100, 30,750,350,120,0,890,400,370,0,415]}

df3 = pd.DataFrame(data3)

關於如何做到這一點的任何建議?

您可以使用mergeupdate

df.update(df.merge(df2, on=['Region', 'Country', 'Product', 'Year'],
                   how='left', suffixes=('_old', None)))

注意。 update到位

輸出:

    Region       Country Product  Year  Price
0   Africa  South Africa     ABC  2016  200.0
1   Africa  South Africa     ABC  2017  100.0
2   Africa  South Africa     ABC  2018   30.0
3   Africa  South Africa     ABC  2019  750.0
4   Africa  South Africa     XYZ  2016  350.0
5   Africa  South Africa     XYZ  2017  120.0
6   Africa  South Africa     XYZ  2018    0.0
7   Africa  South Africa     XYZ  2019  890.0
8     Asia         Japan     DEF  2016  400.0
9     Asia         Japan     DEF  2017  370.0
10    Asia         Japan     DEF  2018    0.0
11    Asia         Japan     DEF  2019  415.0

您可以使用

df['Price'].update(df.merge(df2, on=['Region', 'Country', 'Product', 'Year'], how='left')['Price_y'])
print(df)

    Region       Country Product  Year  Price
0   Africa  South Africa     ABC  2016    200
1   Africa  South Africa     ABC  2017    100
2   Africa  South Africa     ABC  2018     30
3   Africa  South Africa     ABC  2019    750
4   Africa  South Africa     XYZ  2016    350
5   Africa  South Africa     XYZ  2017    120
6   Africa  South Africa     XYZ  2018      0
7   Africa  South Africa     XYZ  2019    890
8     Asia         Japan     DEF  2016    400
9     Asia         Japan     DEF  2017    370
10    Asia         Japan     DEF  2018      0
11    Asia         Japan     DEF  2019    415

我不知道是不是這種情況,但是如果df2帶有df1中未列出的東西怎么辦? 在這里,我在df2中添加了一行數據 Asia, Japan, DEF, 2020, 400。

import pandas as pd
import numpy as np

data1 = {
    'Region': ['Africa','Africa','Africa','Africa',
               'Africa','Africa','Africa','Africa',
               'Asia','Asia','Asia','Asia'],
    'Country': ['South Africa','South Africa',
                'South Africa','South Africa','South Africa',
                'South Africa','South Africa','South Africa',
                'Japan','Japan','Japan','Japan'],
    'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ',
                'XYZ','DEF','DEF','DEF','DEF'],
    'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018,
             2019,2016, 2017, 2018, 2019],
    'Price': [500, 400, 0,450,750,0,0,890,500,
              470,0,415]}

data2 = {
    'Region': ['Africa','Africa','Africa','Africa','Africa',
               'Africa','Asia','Asia', 'Asia'],
    'Country': ['South Africa','South Africa','South Africa',
                'South Africa','South Africa',
                'South Africa','Japan','Japan', 'Japan'],
    'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','DEF',
                'DEF', 'DEF'],
    'Year': [2016, 2017, 2018, 2019,2016, 2017,2016, 2017, 2020],
    'Price': [200, 100, 30,750,350,120,400,370, 400]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

在這里,我將df1稱為第一個數據幀而不是df 然后我添加了幾個步驟,以便我們確切地知道發生了什么。

首先,我在df2中將Price重命名為Price_new ,然后我將在 2 個數據幀之間進行外部連接。

df2 = df2.rename(columns={"Price": "Price_new"})
cols_merge = ['Region', 'Country', 'Product', 'Year']
df = pd.merge(df1, df2, how="outer", on=cols_merge)

這使

    Region       Country Product  Year  Price  Price_new
0   Africa  South Africa     ABC  2016  500.0      200.0
1   Africa  South Africa     ABC  2017  400.0      100.0
2   Africa  South Africa     ABC  2018    0.0       30.0
3   Africa  South Africa     ABC  2019  450.0      750.0
4   Africa  South Africa     XYZ  2016  750.0      350.0
5   Africa  South Africa     XYZ  2017    0.0      120.0
6   Africa  South Africa     XYZ  2018    0.0        NaN
7   Africa  South Africa     XYZ  2019  890.0        NaN
8     Asia         Japan     DEF  2016  500.0      400.0
9     Asia         Japan     DEF  2017  470.0      370.0
10    Asia         Japan     DEF  2018    0.0        NaN
11    Asia         Japan     DEF  2019  415.0        NaN
12    Asia         Japan     DEF  2020    NaN      400.0

現在,只要Price_new不為空,我們就會更新Price

df["Price"] = np.where(
    df["Price_new"].notnull(),
    df["Price_new"],
    df["Price"])

輸出是

    Region       Country Product  Year  Price  Price_new
0   Africa  South Africa     ABC  2016  200.0      200.0
1   Africa  South Africa     ABC  2017  100.0      100.0
2   Africa  South Africa     ABC  2018   30.0       30.0
3   Africa  South Africa     ABC  2019  750.0      750.0
4   Africa  South Africa     XYZ  2016  350.0      350.0
5   Africa  South Africa     XYZ  2017  120.0      120.0
6   Africa  South Africa     XYZ  2018    0.0        NaN
7   Africa  South Africa     XYZ  2019  890.0        NaN
8     Asia         Japan     DEF  2016  400.0      400.0
9     Asia         Japan     DEF  2017  370.0      370.0
10    Asia         Japan     DEF  2018    0.0        NaN
11    Asia         Japan     DEF  2019  415.0        NaN
12    Asia         Japan     DEF  2020  400.0      400.0

你可以永遠刪除額外的列

df = df.drop(columns=["Price_new"])

筆記

其他解決方案很棒,我贊成。 我添加這個是為了向您展示,有時最好使用不太具體的代碼,以便在您的代碼中獲得更好的控制和可維護性。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM