[英]Replace values from a dataframe with values from another with Pandas
我有两个具有相同列的数据框,但值不同且行数不同。
import pandas as pd
data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
'Price': [500, 400, 0,450,750,0,0,890,500,470,0,415]}
data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017,2016, 2017],
'Price': [200, 100, 30,750,350,120,400,370]}
df = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df
是完整的数据集,但有一些旧值,而df2
只有更新的值。 我想用df2
中的值替换df
中的所有值,同时保留df
中不在df2
中的值。
例如,在df
中, Country
= Japan 的值, Product
= DEF 的值, Year
= 2016, Price
应该从 470 更新到 400。2017 年相同,而 2018 年和 2019 年保持不变。
到目前为止,我有以下似乎不起作用的代码:
common_index = ['Region','Country','Product','Year']
df = df.set_index(common_index)
df2 = df2.set_index(common_index)
df.update(df2, overwrite = True)
但这只会使用df2
中的值更新df
并删除其他所有内容。
预期输出应如下所示:
data3 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
'Price': [200, 100, 30,750,350,120,0,890,400,370,0,415]}
df3 = pd.DataFrame(data3)
关于如何做到这一点的任何建议?
df.update(df.merge(df2, on=['Region', 'Country', 'Product', 'Year'],
how='left', suffixes=('_old', None)))
注意。 update
到位。
输出:
Region Country Product Year Price
0 Africa South Africa ABC 2016 200.0
1 Africa South Africa ABC 2017 100.0
2 Africa South Africa ABC 2018 30.0
3 Africa South Africa ABC 2019 750.0
4 Africa South Africa XYZ 2016 350.0
5 Africa South Africa XYZ 2017 120.0
6 Africa South Africa XYZ 2018 0.0
7 Africa South Africa XYZ 2019 890.0
8 Asia Japan DEF 2016 400.0
9 Asia Japan DEF 2017 370.0
10 Asia Japan DEF 2018 0.0
11 Asia Japan DEF 2019 415.0
您可以使用
df['Price'].update(df.merge(df2, on=['Region', 'Country', 'Product', 'Year'], how='left')['Price_y'])
print(df)
Region Country Product Year Price
0 Africa South Africa ABC 2016 200
1 Africa South Africa ABC 2017 100
2 Africa South Africa ABC 2018 30
3 Africa South Africa ABC 2019 750
4 Africa South Africa XYZ 2016 350
5 Africa South Africa XYZ 2017 120
6 Africa South Africa XYZ 2018 0
7 Africa South Africa XYZ 2019 890
8 Asia Japan DEF 2016 400
9 Asia Japan DEF 2017 370
10 Asia Japan DEF 2018 0
11 Asia Japan DEF 2019 415
我不知道是不是这种情况,但是如果df2
带有df1
中未列出的东西怎么办? 在这里,我在df2
中添加了一行数据 Asia, Japan, DEF, 2020, 400。
import pandas as pd
import numpy as np
data1 = {
'Region': ['Africa','Africa','Africa','Africa',
'Africa','Africa','Africa','Africa',
'Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa',
'South Africa','South Africa','South Africa',
'South Africa','South Africa','South Africa',
'Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ',
'XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018,
2019,2016, 2017, 2018, 2019],
'Price': [500, 400, 0,450,750,0,0,890,500,
470,0,415]}
data2 = {
'Region': ['Africa','Africa','Africa','Africa','Africa',
'Africa','Asia','Asia', 'Asia'],
'Country': ['South Africa','South Africa','South Africa',
'South Africa','South Africa',
'South Africa','Japan','Japan', 'Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','DEF',
'DEF', 'DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017,2016, 2017, 2020],
'Price': [200, 100, 30,750,350,120,400,370, 400]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
在这里,我将df1
称为第一个数据帧而不是df
。 然后我添加了几个步骤,以便我们确切地知道发生了什么。
首先,我在df2
中将Price
重命名为Price_new
,然后我将在 2 个数据帧之间进行外部连接。
df2 = df2.rename(columns={"Price": "Price_new"})
cols_merge = ['Region', 'Country', 'Product', 'Year']
df = pd.merge(df1, df2, how="outer", on=cols_merge)
这使
Region Country Product Year Price Price_new
0 Africa South Africa ABC 2016 500.0 200.0
1 Africa South Africa ABC 2017 400.0 100.0
2 Africa South Africa ABC 2018 0.0 30.0
3 Africa South Africa ABC 2019 450.0 750.0
4 Africa South Africa XYZ 2016 750.0 350.0
5 Africa South Africa XYZ 2017 0.0 120.0
6 Africa South Africa XYZ 2018 0.0 NaN
7 Africa South Africa XYZ 2019 890.0 NaN
8 Asia Japan DEF 2016 500.0 400.0
9 Asia Japan DEF 2017 470.0 370.0
10 Asia Japan DEF 2018 0.0 NaN
11 Asia Japan DEF 2019 415.0 NaN
12 Asia Japan DEF 2020 NaN 400.0
现在,只要Price_new
不为空,我们就会更新Price
列
df["Price"] = np.where(
df["Price_new"].notnull(),
df["Price_new"],
df["Price"])
输出是
Region Country Product Year Price Price_new
0 Africa South Africa ABC 2016 200.0 200.0
1 Africa South Africa ABC 2017 100.0 100.0
2 Africa South Africa ABC 2018 30.0 30.0
3 Africa South Africa ABC 2019 750.0 750.0
4 Africa South Africa XYZ 2016 350.0 350.0
5 Africa South Africa XYZ 2017 120.0 120.0
6 Africa South Africa XYZ 2018 0.0 NaN
7 Africa South Africa XYZ 2019 890.0 NaN
8 Asia Japan DEF 2016 400.0 400.0
9 Asia Japan DEF 2017 370.0 370.0
10 Asia Japan DEF 2018 0.0 NaN
11 Asia Japan DEF 2019 415.0 NaN
12 Asia Japan DEF 2020 400.0 400.0
你可以永远删除额外的列
df = df.drop(columns=["Price_new"])
其他解决方案很棒,我赞成。 我添加这个是为了向您展示,有时最好使用不太具体的代码,以便在您的代码中获得更好的控制和可维护性。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.