[英]Replace values from a dataframe with values from another with Pandas
I have two dataframes with identical columns, but different values and different number of rows.我有两个具有相同列的数据框,但值不同且行数不同。
import pandas as pd
data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
'Price': [500, 400, 0,450,750,0,0,890,500,470,0,415]}
data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017,2016, 2017],
'Price': [200, 100, 30,750,350,120,400,370]}
df = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df
is the complete dataset but with some old values, whereas df2
only has the updated values. df
是完整的数据集,但有一些旧值,而df2
只有更新的值。 I want to replace all the values that are in df
with the values in df2
, all while keeping the values from df
that aren't in df2
.我想用
df2
中的值替换df
中的所有值,同时保留df
中不在df2
中的值。
So for example, in df
, the value for Country
= Japan, for Product
= DEF, in Year
= 2016, the Price
should be updated from 470 to 400. The same for 2017, while 2018 and 2019 stay the same.例如,在
df
中, Country
= Japan 的值, Product
= DEF 的值, Year
= 2016, Price
应该从 470 更新到 400。2017 年相同,而 2018 年和 2019 年保持不变。
So far I have the following code that doesn't seem to work:到目前为止,我有以下似乎不起作用的代码:
common_index = ['Region','Country','Product','Year']
df = df.set_index(common_index)
df2 = df2.set_index(common_index)
df.update(df2, overwrite = True)
But this only updates df
with the values from df2
and deletes everything else.但这只会使用
df2
中的值更新df
并删除其他所有内容。
Expected output should look like this:预期输出应如下所示:
data3 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
'Price': [200, 100, 30,750,350,120,0,890,400,370,0,415]}
df3 = pd.DataFrame(data3)
Any suggestions on how I can do this?关于如何做到这一点的任何建议?
You can use merge
and update
:您可以使用
merge
和update
:
df.update(df.merge(df2, on=['Region', 'Country', 'Product', 'Year'],
how='left', suffixes=('_old', None)))
NB.注意。 the
update
is in place . update
到位。
output:输出:
Region Country Product Year Price
0 Africa South Africa ABC 2016 200.0
1 Africa South Africa ABC 2017 100.0
2 Africa South Africa ABC 2018 30.0
3 Africa South Africa ABC 2019 750.0
4 Africa South Africa XYZ 2016 350.0
5 Africa South Africa XYZ 2017 120.0
6 Africa South Africa XYZ 2018 0.0
7 Africa South Africa XYZ 2019 890.0
8 Asia Japan DEF 2016 400.0
9 Asia Japan DEF 2017 370.0
10 Asia Japan DEF 2018 0.0
11 Asia Japan DEF 2019 415.0
You can use您可以使用
df['Price'].update(df.merge(df2, on=['Region', 'Country', 'Product', 'Year'], how='left')['Price_y'])
print(df)
Region Country Product Year Price
0 Africa South Africa ABC 2016 200
1 Africa South Africa ABC 2017 100
2 Africa South Africa ABC 2018 30
3 Africa South Africa ABC 2019 750
4 Africa South Africa XYZ 2016 350
5 Africa South Africa XYZ 2017 120
6 Africa South Africa XYZ 2018 0
7 Africa South Africa XYZ 2019 890
8 Asia Japan DEF 2016 400
9 Asia Japan DEF 2017 370
10 Asia Japan DEF 2018 0
11 Asia Japan DEF 2019 415
I don't know if this is the case but what if df2
carry something not listed in df1
?我不知道是不是这种情况,但是如果
df2
带有df1
中未列出的东西怎么办? Here I'm adding a row to df2
with data Asia, Japan, DEF, 2020, 400.在这里,我在
df2
中添加了一行数据 Asia, Japan, DEF, 2020, 400。
import pandas as pd
import numpy as np
data1 = {
'Region': ['Africa','Africa','Africa','Africa',
'Africa','Africa','Africa','Africa',
'Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa',
'South Africa','South Africa','South Africa',
'South Africa','South Africa','South Africa',
'Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ',
'XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018,
2019,2016, 2017, 2018, 2019],
'Price': [500, 400, 0,450,750,0,0,890,500,
470,0,415]}
data2 = {
'Region': ['Africa','Africa','Africa','Africa','Africa',
'Africa','Asia','Asia', 'Asia'],
'Country': ['South Africa','South Africa','South Africa',
'South Africa','South Africa',
'South Africa','Japan','Japan', 'Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','DEF',
'DEF', 'DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017,2016, 2017, 2020],
'Price': [200, 100, 30,750,350,120,400,370, 400]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
Here I call df1
the first dataframe instead of df
.在这里,我将
df1
称为第一个数据帧而不是df
。 Then I'm adding few step so we know exactly what is going on.然后我添加了几个步骤,以便我们确切地知道发生了什么。
First I rename Price
to Price_new
in df2
then I'll do an outer join between the 2 dataframes.首先,我在
df2
中将Price
重命名为Price_new
,然后我将在 2 个数据帧之间进行外部连接。
df2 = df2.rename(columns={"Price": "Price_new"})
cols_merge = ['Region', 'Country', 'Product', 'Year']
df = pd.merge(df1, df2, how="outer", on=cols_merge)
which gives这使
Region Country Product Year Price Price_new
0 Africa South Africa ABC 2016 500.0 200.0
1 Africa South Africa ABC 2017 400.0 100.0
2 Africa South Africa ABC 2018 0.0 30.0
3 Africa South Africa ABC 2019 450.0 750.0
4 Africa South Africa XYZ 2016 750.0 350.0
5 Africa South Africa XYZ 2017 0.0 120.0
6 Africa South Africa XYZ 2018 0.0 NaN
7 Africa South Africa XYZ 2019 890.0 NaN
8 Asia Japan DEF 2016 500.0 400.0
9 Asia Japan DEF 2017 470.0 370.0
10 Asia Japan DEF 2018 0.0 NaN
11 Asia Japan DEF 2019 415.0 NaN
12 Asia Japan DEF 2020 NaN 400.0
Now wherever Price_new
is not null we update the Price
column现在,只要
Price_new
不为空,我们就会更新Price
列
df["Price"] = np.where(
df["Price_new"].notnull(),
df["Price_new"],
df["Price"])
The output being输出是
Region Country Product Year Price Price_new
0 Africa South Africa ABC 2016 200.0 200.0
1 Africa South Africa ABC 2017 100.0 100.0
2 Africa South Africa ABC 2018 30.0 30.0
3 Africa South Africa ABC 2019 750.0 750.0
4 Africa South Africa XYZ 2016 350.0 350.0
5 Africa South Africa XYZ 2017 120.0 120.0
6 Africa South Africa XYZ 2018 0.0 NaN
7 Africa South Africa XYZ 2019 890.0 NaN
8 Asia Japan DEF 2016 400.0 400.0
9 Asia Japan DEF 2017 370.0 370.0
10 Asia Japan DEF 2018 0.0 NaN
11 Asia Japan DEF 2019 415.0 NaN
12 Asia Japan DEF 2020 400.0 400.0
And you can evertually remove the extra column with你可以永远删除额外的列
df = df.drop(columns=["Price_new"])
The other solutions are great and I upvoted them.其他解决方案很棒,我赞成。 I added this to show you that sometime is better to use less specific code in order to have better control and maintainability in your code.
我添加这个是为了向您展示,有时最好使用不太具体的代码,以便在您的代码中获得更好的控制和可维护性。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.