简体   繁体   English

使用 Pandas 将数据框中的值替换为另一个数据框中的值

[英]Replace values from a dataframe with values from another with Pandas

I have two dataframes with identical columns, but different values and different number of rows.我有两个具有相同列的数据框,但值不同且行数不同。

import pandas as pd

data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
         'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
         'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
         'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
         'Price': [500, 400, 0,450,750,0,0,890,500,470,0,415]}

data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia'],
         'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan'],
         'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','DEF','DEF'],
         'Year': [2016, 2017, 2018, 2019,2016, 2017,2016, 2017],
         'Price': [200, 100, 30,750,350,120,400,370]}

df = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

df is the complete dataset but with some old values, whereas df2 only has the updated values. df是完整的数据集,但有一些旧值,而df2只有更新的值。 I want to replace all the values that are in df with the values in df2 , all while keeping the values from df that aren't in df2 .我想用df2中的值替换df中的所有值,同时保留df中不在df2中的值。

So for example, in df , the value for Country = Japan, for Product = DEF, in Year = 2016, the Price should be updated from 470 to 400. The same for 2017, while 2018 and 2019 stay the same.例如,在df中, Country = Japan 的值, Product = DEF 的值, Year = 2016, Price应该从 470 更新到 400。2017 年相同,而 2018 年和 2019 年保持不变。

So far I have the following code that doesn't seem to work:到目前为止,我有以下似乎不起作用的代码:

common_index = ['Region','Country','Product','Year']
df = df.set_index(common_index)
df2 = df2.set_index(common_index)
df.update(df2, overwrite = True)

But this only updates df with the values from df2 and deletes everything else.但这只会使用df2中的值更新df并删除其他所有内容。

Expected output should look like this:预期输出应如下所示:

data3 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
         'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
         'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
         'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
         'Price': [200, 100, 30,750,350,120,0,890,400,370,0,415]}

df3 = pd.DataFrame(data3)

Any suggestions on how I can do this?关于如何做到这一点的任何建议?

You can use merge and update :您可以使用mergeupdate

df.update(df.merge(df2, on=['Region', 'Country', 'Product', 'Year'],
                   how='left', suffixes=('_old', None)))

NB.注意。 the update is in place . update到位

output:输出:

    Region       Country Product  Year  Price
0   Africa  South Africa     ABC  2016  200.0
1   Africa  South Africa     ABC  2017  100.0
2   Africa  South Africa     ABC  2018   30.0
3   Africa  South Africa     ABC  2019  750.0
4   Africa  South Africa     XYZ  2016  350.0
5   Africa  South Africa     XYZ  2017  120.0
6   Africa  South Africa     XYZ  2018    0.0
7   Africa  South Africa     XYZ  2019  890.0
8     Asia         Japan     DEF  2016  400.0
9     Asia         Japan     DEF  2017  370.0
10    Asia         Japan     DEF  2018    0.0
11    Asia         Japan     DEF  2019  415.0

You can use您可以使用

df['Price'].update(df.merge(df2, on=['Region', 'Country', 'Product', 'Year'], how='left')['Price_y'])
print(df)

    Region       Country Product  Year  Price
0   Africa  South Africa     ABC  2016    200
1   Africa  South Africa     ABC  2017    100
2   Africa  South Africa     ABC  2018     30
3   Africa  South Africa     ABC  2019    750
4   Africa  South Africa     XYZ  2016    350
5   Africa  South Africa     XYZ  2017    120
6   Africa  South Africa     XYZ  2018      0
7   Africa  South Africa     XYZ  2019    890
8     Asia         Japan     DEF  2016    400
9     Asia         Japan     DEF  2017    370
10    Asia         Japan     DEF  2018      0
11    Asia         Japan     DEF  2019    415

I don't know if this is the case but what if df2 carry something not listed in df1 ?我不知道是不是这种情况,但是如果df2带有df1中未列出的东西怎么办? Here I'm adding a row to df2 with data Asia, Japan, DEF, 2020, 400.在这里,我在df2中添加了一行数据 Asia, Japan, DEF, 2020, 400。

import pandas as pd
import numpy as np

data1 = {
    'Region': ['Africa','Africa','Africa','Africa',
               'Africa','Africa','Africa','Africa',
               'Asia','Asia','Asia','Asia'],
    'Country': ['South Africa','South Africa',
                'South Africa','South Africa','South Africa',
                'South Africa','South Africa','South Africa',
                'Japan','Japan','Japan','Japan'],
    'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ',
                'XYZ','DEF','DEF','DEF','DEF'],
    'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018,
             2019,2016, 2017, 2018, 2019],
    'Price': [500, 400, 0,450,750,0,0,890,500,
              470,0,415]}

data2 = {
    'Region': ['Africa','Africa','Africa','Africa','Africa',
               'Africa','Asia','Asia', 'Asia'],
    'Country': ['South Africa','South Africa','South Africa',
                'South Africa','South Africa',
                'South Africa','Japan','Japan', 'Japan'],
    'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','DEF',
                'DEF', 'DEF'],
    'Year': [2016, 2017, 2018, 2019,2016, 2017,2016, 2017, 2020],
    'Price': [200, 100, 30,750,350,120,400,370, 400]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

Here I call df1 the first dataframe instead of df .在这里,我将df1称为第一个数据帧而不是df Then I'm adding few step so we know exactly what is going on.然后我添加了几个步骤,以便我们确切地知道发生了什么。

First I rename Price to Price_new in df2 then I'll do an outer join between the 2 dataframes.首先,我在df2中将Price重命名为Price_new ,然后我将在 2 个数据帧之间进行外部连接。

df2 = df2.rename(columns={"Price": "Price_new"})
cols_merge = ['Region', 'Country', 'Product', 'Year']
df = pd.merge(df1, df2, how="outer", on=cols_merge)

which gives这使

    Region       Country Product  Year  Price  Price_new
0   Africa  South Africa     ABC  2016  500.0      200.0
1   Africa  South Africa     ABC  2017  400.0      100.0
2   Africa  South Africa     ABC  2018    0.0       30.0
3   Africa  South Africa     ABC  2019  450.0      750.0
4   Africa  South Africa     XYZ  2016  750.0      350.0
5   Africa  South Africa     XYZ  2017    0.0      120.0
6   Africa  South Africa     XYZ  2018    0.0        NaN
7   Africa  South Africa     XYZ  2019  890.0        NaN
8     Asia         Japan     DEF  2016  500.0      400.0
9     Asia         Japan     DEF  2017  470.0      370.0
10    Asia         Japan     DEF  2018    0.0        NaN
11    Asia         Japan     DEF  2019  415.0        NaN
12    Asia         Japan     DEF  2020    NaN      400.0

Now wherever Price_new is not null we update the Price column现在,只要Price_new不为空,我们就会更新Price

df["Price"] = np.where(
    df["Price_new"].notnull(),
    df["Price_new"],
    df["Price"])

The output being输出是

    Region       Country Product  Year  Price  Price_new
0   Africa  South Africa     ABC  2016  200.0      200.0
1   Africa  South Africa     ABC  2017  100.0      100.0
2   Africa  South Africa     ABC  2018   30.0       30.0
3   Africa  South Africa     ABC  2019  750.0      750.0
4   Africa  South Africa     XYZ  2016  350.0      350.0
5   Africa  South Africa     XYZ  2017  120.0      120.0
6   Africa  South Africa     XYZ  2018    0.0        NaN
7   Africa  South Africa     XYZ  2019  890.0        NaN
8     Asia         Japan     DEF  2016  400.0      400.0
9     Asia         Japan     DEF  2017  370.0      370.0
10    Asia         Japan     DEF  2018    0.0        NaN
11    Asia         Japan     DEF  2019  415.0        NaN
12    Asia         Japan     DEF  2020  400.0      400.0

And you can evertually remove the extra column with你可以永远删除额外的列

df = df.drop(columns=["Price_new"])

Note笔记

The other solutions are great and I upvoted them.其他解决方案很棒,我赞成。 I added this to show you that sometime is better to use less specific code in order to have better control and maintainability in your code.我添加这个是为了向您展示,有时最好使用不太具体的代码,以便在您的代码中获得更好的控制和可维护性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM