[英]Replace column values based on another dataframe python pandas - better way?
Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).注意:为简单起见,我使用了一个玩具示例,因为在堆栈溢出时复制/粘贴数据帧很困难(请告诉我是否有一种简单的方法可以做到这一点)。
Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns?有没有办法将一个数据帧中的值合并到另一个数据帧而不获取 _X、_Y 列? I'd like the values on one column to replace all zero values of another column.我希望一列上的值替换另一列的所有零值。
df1:
Name Nonprofit Business Education
X 1 1 0
Y 0 1 0 <- Y and Z have zero values for Nonprofit and Educ
Z 0 0 0
Y 0 1 0
df2:
Name Nonprofit Education
Y 1 1 <- this df has the correct values.
Z 1 1
pd.merge(df1, df2, on='Name', how='outer')
Name Nonprofit_X Business Education_X Nonprofit_Y Education_Y
Y 1 1 1 1 1
Y 1 1 1 1 1
X 1 1 0 nan nan
Z 1 1 1 1 1
In a previous post, I tried combine_First and dropna(), but these don't do the job.在上一篇文章中,我尝试了 combine_First 和 dropna(),但这些都不起作用。
I want to replace zeros in df1 with the values in df2.我想用 df2 中的值替换 df1 中的零。 Furthermore, I want all rows with the same Names to be changed according to df2.此外,我希望根据 df2 更改具有相同名称的所有行。
Name Nonprofit Business Education
Y 1 1 1
Y 1 1 1
X 1 1 0
Z 1 0 1
(need to clarify: The value in 'Business' column where name = Z should 0.) (需要澄清:名称 = Z 的“业务”列中的值应为 0。)
My existing solution does the following: I subset based on the names that exist in df2, and then replace those values with the correct value.我现有的解决方案执行以下操作:我根据 df2 中存在的名称进行子集化,然后将这些值替换为正确的值。 However, I'd like a less hacky way to do this.但是,我想要一种不那么笨拙的方法来做到这一点。
pubunis_df = df2
sdf = df1
regex = str_to_regex(', '.join(pubunis_df.ORGS))
pubunis = searchnamesre(sdf, 'ORGS', regex)
sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)
KSD's answer will raise error: KSD 的回答会引发错误:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1]],columns=["Name","Nonprofit", "Education"])
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values
Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)
and EdChum's answer will give us the wrong result:而 EdChum 的回答会给我们错误的结果:
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]
df1
Out[852]:
Name Nonprofit Business Education
0 X 1.0 1 0.0
1 Y 1.0 1 1.0
2 Z NaN 0 NaN
3 Y NaN 1 NaN
Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.好吧,只有当“名称”列中的值是唯一的并且在两个数据框中都排序时,它才会安全地工作。
Here is my answer:这是我的回答:
df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)
df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)
More guide about update. 有关更新的更多指南。 . . The columns names of both data frames need to set index are not necessary same before 'update'.在“更新”之前,需要设置索引的两个数据框的列名不必相同。 You could try 'Name1' and 'Name2'.您可以尝试“Name1”和“Name2”。 Also, it works even if other unnecessary row in df2, which won't update df1.此外,即使 df2 中的其他不必要的行也不会更新 df1,它也能工作。 In other words, df2 doesn't need to be the super set of df1.换句话说,df2 不需要是 df1 的超集。
Example:例子:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1],
['U',1,3]],columns=["Name2","Nonprofit", "Education"])
df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')
df1.update(df2)
result:结果:
Nonprofit Business Education
Name1
X 1.0 1 0.0
Y 1.0 1 1.0
Z 1.0 0 1.0
Y 1.0 1 1.0
Use the boolean mask from isin
to filter the df and assign the desired row values from the rhs df:使用isin
的布尔掩码过滤 df 并从 rhs df 中分配所需的行值:
In [27]:
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
In [27]: This is the correct one.在 [27] 中:这是正确的。
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns] [4 行 x 4 列]
The above will work only when all rows in df1 exists in df .仅当 df1 中的所有行都存在于 df 中时,上述内容才有效。 In other words df should be super set of df1换句话说 df 应该是 df1 的超集
Incase if you have some non matching rows to df in df1,you should follow below如果你在 df1 中有一些与 df 不匹配的行,你应该按照下面的操作
In other words df is not superset of df1 :换句话说 df 不是 df1 的超集:
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] =
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values
df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.