[英]Replace values from a dataframe with values from another with Pandas
我有兩個具有相同列的數據框,但值不同且行數不同。
import pandas as pd
data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
'Price': [500, 400, 0,450,750,0,0,890,500,470,0,415]}
data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017,2016, 2017],
'Price': [200, 100, 30,750,350,120,400,370]}
df = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df
是完整的數據集,但有一些舊值,而df2
只有更新的值。 我想用df2
中的值替換df
中的所有值,同時保留df
中不在df2
中的值。
例如,在df
中, Country
= Japan 的值, Product
= DEF 的值, Year
= 2016, Price
應該從 470 更新到 400。2017 年相同,而 2018 年和 2019 年保持不變。
到目前為止,我有以下似乎不起作用的代碼:
common_index = ['Region','Country','Product','Year']
df = df.set_index(common_index)
df2 = df2.set_index(common_index)
df.update(df2, overwrite = True)
但這只會使用df2
中的值更新df
並刪除其他所有內容。
預期輸出應如下所示:
data3 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
'Price': [200, 100, 30,750,350,120,0,890,400,370,0,415]}
df3 = pd.DataFrame(data3)
關於如何做到這一點的任何建議?
df.update(df.merge(df2, on=['Region', 'Country', 'Product', 'Year'],
how='left', suffixes=('_old', None)))
注意。 update
到位。
輸出:
Region Country Product Year Price
0 Africa South Africa ABC 2016 200.0
1 Africa South Africa ABC 2017 100.0
2 Africa South Africa ABC 2018 30.0
3 Africa South Africa ABC 2019 750.0
4 Africa South Africa XYZ 2016 350.0
5 Africa South Africa XYZ 2017 120.0
6 Africa South Africa XYZ 2018 0.0
7 Africa South Africa XYZ 2019 890.0
8 Asia Japan DEF 2016 400.0
9 Asia Japan DEF 2017 370.0
10 Asia Japan DEF 2018 0.0
11 Asia Japan DEF 2019 415.0
您可以使用
df['Price'].update(df.merge(df2, on=['Region', 'Country', 'Product', 'Year'], how='left')['Price_y'])
print(df)
Region Country Product Year Price
0 Africa South Africa ABC 2016 200
1 Africa South Africa ABC 2017 100
2 Africa South Africa ABC 2018 30
3 Africa South Africa ABC 2019 750
4 Africa South Africa XYZ 2016 350
5 Africa South Africa XYZ 2017 120
6 Africa South Africa XYZ 2018 0
7 Africa South Africa XYZ 2019 890
8 Asia Japan DEF 2016 400
9 Asia Japan DEF 2017 370
10 Asia Japan DEF 2018 0
11 Asia Japan DEF 2019 415
我不知道是不是這種情況,但是如果df2
帶有df1
中未列出的東西怎么辦? 在這里,我在df2
中添加了一行數據 Asia, Japan, DEF, 2020, 400。
import pandas as pd
import numpy as np
data1 = {
'Region': ['Africa','Africa','Africa','Africa',
'Africa','Africa','Africa','Africa',
'Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa',
'South Africa','South Africa','South Africa',
'South Africa','South Africa','South Africa',
'Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ',
'XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018,
2019,2016, 2017, 2018, 2019],
'Price': [500, 400, 0,450,750,0,0,890,500,
470,0,415]}
data2 = {
'Region': ['Africa','Africa','Africa','Africa','Africa',
'Africa','Asia','Asia', 'Asia'],
'Country': ['South Africa','South Africa','South Africa',
'South Africa','South Africa',
'South Africa','Japan','Japan', 'Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','DEF',
'DEF', 'DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017,2016, 2017, 2020],
'Price': [200, 100, 30,750,350,120,400,370, 400]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
在這里,我將df1
稱為第一個數據幀而不是df
。 然后我添加了幾個步驟,以便我們確切地知道發生了什么。
首先,我在df2
中將Price
重命名為Price_new
,然后我將在 2 個數據幀之間進行外部連接。
df2 = df2.rename(columns={"Price": "Price_new"})
cols_merge = ['Region', 'Country', 'Product', 'Year']
df = pd.merge(df1, df2, how="outer", on=cols_merge)
這使
Region Country Product Year Price Price_new
0 Africa South Africa ABC 2016 500.0 200.0
1 Africa South Africa ABC 2017 400.0 100.0
2 Africa South Africa ABC 2018 0.0 30.0
3 Africa South Africa ABC 2019 450.0 750.0
4 Africa South Africa XYZ 2016 750.0 350.0
5 Africa South Africa XYZ 2017 0.0 120.0
6 Africa South Africa XYZ 2018 0.0 NaN
7 Africa South Africa XYZ 2019 890.0 NaN
8 Asia Japan DEF 2016 500.0 400.0
9 Asia Japan DEF 2017 470.0 370.0
10 Asia Japan DEF 2018 0.0 NaN
11 Asia Japan DEF 2019 415.0 NaN
12 Asia Japan DEF 2020 NaN 400.0
現在,只要Price_new
不為空,我們就會更新Price
列
df["Price"] = np.where(
df["Price_new"].notnull(),
df["Price_new"],
df["Price"])
輸出是
Region Country Product Year Price Price_new
0 Africa South Africa ABC 2016 200.0 200.0
1 Africa South Africa ABC 2017 100.0 100.0
2 Africa South Africa ABC 2018 30.0 30.0
3 Africa South Africa ABC 2019 750.0 750.0
4 Africa South Africa XYZ 2016 350.0 350.0
5 Africa South Africa XYZ 2017 120.0 120.0
6 Africa South Africa XYZ 2018 0.0 NaN
7 Africa South Africa XYZ 2019 890.0 NaN
8 Asia Japan DEF 2016 400.0 400.0
9 Asia Japan DEF 2017 370.0 370.0
10 Asia Japan DEF 2018 0.0 NaN
11 Asia Japan DEF 2019 415.0 NaN
12 Asia Japan DEF 2020 400.0 400.0
你可以永遠刪除額外的列
df = df.drop(columns=["Price_new"])
其他解決方案很棒,我贊成。 我添加這個是為了向您展示,有時最好使用不太具體的代碼,以便在您的代碼中獲得更好的控制和可維護性。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.