基于另一个数据框 python pandas 替换列值 - 更好的方法？

Question

Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).注意：为简单起见，我使用了一个玩具示例，因为在堆栈溢出时复制/粘贴数据帧很困难（请告诉我是否有一种简单的方法可以做到这一点）。

Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns?有没有办法将一个数据帧中的值合并到另一个数据帧而不获取 _X、_Y 列？ I'd like the values on one column to replace all zero values of another column.我希望一列上的值替换另一列的所有零值。

df1: 

Name   Nonprofit    Business    Education

X      1             1           0
Y      0             1           0   <- Y and Z have zero values for Nonprofit and Educ
Z      0             0           0
Y      0             1           0

df2:

Name   Nonprofit    Education
Y       1            1     <- this df has the correct values. 
Z       1            1



pd.merge(df1, df2, on='Name', how='outer')

Name   Nonprofit_X    Business    Education_X     Nonprofit_Y     Education_Y
Y       1                1          1                1               1
Y      1                 1          1                1               1
X      1                 1          0               nan             nan   
Z      1                 1          1                1               1

In a previous post, I tried combine_First and dropna(), but these don't do the job.在上一篇文章中，我尝试了 combine_First 和 dropna()，但这些都不起作用。

I want to replace zeros in df1 with the values in df2.我想用 df2 中的值替换 df1 中的零。 Furthermore, I want all rows with the same Names to be changed according to df2.此外，我希望根据 df2 更改具有相同名称的所有行。

Name    Nonprofit     Business    Education
Y        1             1           1
Y        1             1           1 
X        1             1           0
Z        1             0           1

(need to clarify: The value in 'Business' column where name = Z should 0.) （需要澄清：名称 = Z 的“业务”列中的值应为 0。）

My existing solution does the following: I subset based on the names that exist in df2, and then replace those values with the correct value.我现有的解决方案执行以下操作：我根据 df2 中存在的名称进行子集化，然后将这些值替换为正确的值。 However, I'd like a less hacky way to do this.但是，我想要一种不那么笨拙的方法来做到这一点。

pubunis_df = df2
sdf = df1 

regex = str_to_regex(', '.join(pubunis_df.ORGS))

pubunis = searchnamesre(sdf, 'ORGS', regex)

sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)

Answer 1

Attention: In latest version of pandas, both answers above doesn't work anymore:注意：在最新版本的熊猫中，以上两个答案都不再适用：

KSD's answer will raise error: KSD 的回答会引发错误：

df1 = pd.DataFrame([["X",1,1,0],
              ["Y",0,1,0],
              ["Z",0,0,0],
              ["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])    

df2 = pd.DataFrame([["Y",1,1],
              ["Z",1,1]],columns=["Name","Nonprofit", "Education"])   

df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values

df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values

Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)

and EdChum's answer will give us the wrong result:而 EdChum 的回答会给我们错误的结果：

 df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]

df1
Out[852]: 
  Name  Nonprofit  Business  Education
0    X        1.0         1        0.0
1    Y        1.0         1        1.0
2    Z        NaN         0        NaN
3    Y        NaN         1        NaN

Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.好吧，只有当“名称”列中的值是唯一的并且在两个数据框中都排序时，它才会安全地工作。

Here is my answer:这是我的回答：

Way 1:方式一：

df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)

Way 2:方式二：

df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)

More guide about update. 有关更新的更多指南。 . . The columns names of both data frames need to set index are not necessary same before 'update'.在“更新”之前，需要设置索引的两个数据框的列名不必相同。 You could try 'Name1' and 'Name2'.您可以尝试“Name1”和“Name2”。 Also, it works even if other unnecessary row in df2, which won't update df1.此外，即使 df2 中的其他不必要的行也不会更新 df1，它也能工作。 In other words, df2 doesn't need to be the super set of df1.换句话说，df2 不需要是 df1 的超集。

Example:例子：

df1 = pd.DataFrame([["X",1,1,0],
              ["Y",0,1,0],
              ["Z",0,0,0],
              ["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])    

df2 = pd.DataFrame([["Y",1,1],
              ["Z",1,1],
              ['U',1,3]],columns=["Name2","Nonprofit", "Education"])   

df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')


df1.update(df2)

result:结果：

      Nonprofit  Business  Education
Name1                                
X           1.0         1        0.0
Y           1.0         1        1.0
Z           1.0         0        1.0
Y           1.0         1        1.0

Answer 2

Use the boolean mask from isin to filter the df and assign the desired row values from the rhs df:使用isin的布尔掩码过滤 df 并从 rhs df 中分配所需的行值：

In [27]:

df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
  Name  Nonprofit  Business  Education
0    X          1         1          0
1    Y          1         1          1
2    Z          1         0          1
3    Y          1         1          1

[4 rows x 4 columns]

Answer 3

In [27]: This is the correct one.在 [27] 中：这是正确的。

df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values

df
Out[27]:

Name  Nonprofit  Business  Education

0    X          1         1          0
1    Y          1         1          1
2    Z          1         0          1
3    Y          1         1          1

[4 rows x 4 columns] [4 行 x 4 列]

The above will work only when all rows in df1 exists in df .仅当 df1 中的所有行都存在于 df 中时，上述内容才有效。 In other words df should be super set of df1换句话说 df 应该是 df1 的超集

Incase if you have some non matching rows to df in df1,you should follow below如果你在 df1 中有一些与 df 不匹配的行，你应该按照下面的操作

In other words df is not superset of df1 :换句话说 df 不是 df1 的超集：

df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = 
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values

Answer 4

df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()

基于另一个数据框 python pandas 替换列值 - 更好的方法？

问题描述

4 个解决方案

解决方案1
63 2019-04-11 11:55:15

Attention: In latest version of pandas, both answers above doesn't work anymore:注意：在最新版本的熊猫中，以上两个答案都不再适用：

Way 1:方式一：

Way 2:方式二：

解决方案2
37 已采纳 2014-07-15 21:45:45

解决方案3
23 2017-10-03 18:05:13

解决方案4
5 2020-02-04 15:00:22

基于另一个数据框 python pandas 替换列值 - 更好的方法？

问题描述

4 个解决方案

解决方案1 63 2019-04-11 11:55:15

Attention: In latest version of pandas, both answers above doesn't work anymore:注意：在最新版本的熊猫中，以上两个答案都不再适用：

Way 1:方式一：

Way 2:方式二：

解决方案2 37 已采纳 2014-07-15 21:45:45

解决方案3 23 2017-10-03 18:05:13

解决方案4 5 2020-02-04 15:00:22

解决方案1
63 2019-04-11 11:55:15

解决方案2
37 已采纳 2014-07-15 21:45:45

解决方案3
23 2017-10-03 18:05:13

解决方案4
5 2020-02-04 15:00:22