简体   繁体   English

基于另一个数据框 python pandas 替换列值 - 更好的方法?

[英]Replace column values based on another dataframe python pandas - better way?

Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).注意:为简单起见,我使用了一个玩具示例,因为在堆栈溢出时复制/粘贴数据帧很困难(请告诉我是否有一种简单的方法可以做到这一点)。

Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns?有没有办法将一个数据帧中的值合并到另一个数据帧而不获取 _X、_Y 列? I'd like the values on one column to replace all zero values of another column.我希望一列上的值替换另一列的所有零值。

df1: 

Name   Nonprofit    Business    Education

X      1             1           0
Y      0             1           0   <- Y and Z have zero values for Nonprofit and Educ
Z      0             0           0
Y      0             1           0

df2:

Name   Nonprofit    Education
Y       1            1     <- this df has the correct values. 
Z       1            1



pd.merge(df1, df2, on='Name', how='outer')

Name   Nonprofit_X    Business    Education_X     Nonprofit_Y     Education_Y
Y       1                1          1                1               1
Y      1                 1          1                1               1
X      1                 1          0               nan             nan   
Z      1                 1          1                1               1

In a previous post, I tried combine_First and dropna(), but these don't do the job.在上一篇文章中,我尝试了 combine_First 和 dropna(),但这些都不起作用。

I want to replace zeros in df1 with the values in df2.我想用 df2 中的值替换 df1 中的零。 Furthermore, I want all rows with the same Names to be changed according to df2.此外,我希望根据 df2 更改具有相同名称的所有行。

Name    Nonprofit     Business    Education
Y        1             1           1
Y        1             1           1 
X        1             1           0
Z        1             0           1

(need to clarify: The value in 'Business' column where name = Z should 0.) (需要澄清:名称 = Z 的“业务”列中的值应为 0。)

My existing solution does the following: I subset based on the names that exist in df2, and then replace those values with the correct value.我现有的解决方案执行以下操作:我根据 df2 中存在的名称进行子集化,然后将这些值替换为正确的值。 However, I'd like a less hacky way to do this.但是,我想要一种不那么笨拙的方法来做到这一点。

pubunis_df = df2
sdf = df1 

regex = str_to_regex(', '.join(pubunis_df.ORGS))

pubunis = searchnamesre(sdf, 'ORGS', regex)

sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)

Attention: In latest version of pandas, both answers above doesn't work anymore:注意:在最新版本的熊猫中,以上两个答案都不再适用:

KSD's answer will raise error: KSD 的回答会引发错误:

df1 = pd.DataFrame([["X",1,1,0],
              ["Y",0,1,0],
              ["Z",0,0,0],
              ["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])    

df2 = pd.DataFrame([["Y",1,1],
              ["Z",1,1]],columns=["Name","Nonprofit", "Education"])   

df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values

df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values

Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)

and EdChum's answer will give us the wrong result:而 EdChum 的回答会给我们错误的结果:

 df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]

df1
Out[852]: 
  Name  Nonprofit  Business  Education
0    X        1.0         1        0.0
1    Y        1.0         1        1.0
2    Z        NaN         0        NaN
3    Y        NaN         1        NaN

Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.好吧,只有当“名称”列中的值是唯一的并且在两个数据框中都排序时,它才会安全地工作。

Here is my answer:这是我的回答:

Way 1:方式一:

df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)

Way 2:方式二:

df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)

More guide about update. 有关更新的更多指南。 . . The columns names of both data frames need to set index are not necessary same before 'update'.在“更新”之前,需要设置索引的两个数据框的列名不必相同。 You could try 'Name1' and 'Name2'.您可以尝试“Name1”和“Name2”。 Also, it works even if other unnecessary row in df2, which won't update df1.此外,即使 df2 中的其他不必要的行也不会更新 df1,它也能工作。 In other words, df2 doesn't need to be the super set of df1.换句话说,df2 不需要是 df1 的超集。

Example:例子:

df1 = pd.DataFrame([["X",1,1,0],
              ["Y",0,1,0],
              ["Z",0,0,0],
              ["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])    

df2 = pd.DataFrame([["Y",1,1],
              ["Z",1,1],
              ['U',1,3]],columns=["Name2","Nonprofit", "Education"])   

df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')


df1.update(df2)

result:结果:

      Nonprofit  Business  Education
Name1                                
X           1.0         1        0.0
Y           1.0         1        1.0
Z           1.0         0        1.0
Y           1.0         1        1.0

Use the boolean mask from isin to filter the df and assign the desired row values from the rhs df:使用isin的布尔掩码过滤 df 并从 rhs df 中分配所需的行值:

In [27]:

df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
  Name  Nonprofit  Business  Education
0    X          1         1          0
1    Y          1         1          1
2    Z          1         0          1
3    Y          1         1          1

[4 rows x 4 columns]

In [27]: This is the correct one.在 [27] 中:这是正确的。

df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values

df
Out[27]:

Name  Nonprofit  Business  Education

0    X          1         1          0
1    Y          1         1          1
2    Z          1         0          1
3    Y          1         1          1

[4 rows x 4 columns] [4 行 x 4 列]

The above will work only when all rows in df1 exists in df .仅当 df1 中的所有行都存在于 df 中时,上述内容才有效。 In other words df should be super set of df1换句话说 df 应该是 df1 的超集

Incase if you have some non matching rows to df in df1,you should follow below如果你在 df1 中有一些与 df 不匹配的行,你应该按照下面的操作

In other words df is not superset of df1 :换句话说 df 不是 df1 的超集:

df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = 
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values
df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据另一列的值替换Pandas数据框的Column的值 - Replace values of a Pandas dataframe's Column based on values of another column Pandas数据框:根据另一列中的值替换多行 - Pandas dataframe: Replace multiple rows based on values in another column 根据条件从另一个 dataframe 值替换列的值 - Python - Replace values of a column from another dataframe values based on a condition - Python 有没有办法根据 Python 中另一列的时间戳获取 Pandas DataFrame 值? - Is there a way to get Pandas DataFrame values based on timestamp from another column in Python? 根据列名称替换pandas数据框中的值 - Replace values in pandas dataframe based on column names Pandas 将列的值替换为与另一个 Dataframe 的比较 - Pandas replace values of a column with comparison to another Dataframe Python pandas 用模式(同一列 -A)相对于 Pandas 数据帧中的另一列替换一列(A)的 NaN 值 - Python pandas replace NaN values of one column(A) by mode (of same column -A) with respect to another column in pandas dataframe 如何根据另一个 dataframe 中的查找值替换 pandas dataframe 值? - How to replace pandas dataframe values based on lookup values in another dataframe? Pandas - 有没有更好的方法用另一个 dataframe 列更新列 - Pandas - Is there a better way to update column with another dataframe column pandas:如果该值在第二个 dataframe 中,则根据另一个 dataframe 中的条件替换列中的值 - pandas: replace values in a column based on a condition in another dataframe if that value is in the second dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM