![](/img/trans.png)
[英]Filter duplicated rows based on selected columns and comparing with another dataframe in Pandas
[英]Partially update a dataframe based on selected rows and columns from another
我有两个数据框如下:
df1
Name Id c1 c2 c3 c4
---------------------------
asd 101 a b c d
cdf 231 e ? 1
zxs 342 f o
ygg 521 g k p
mlk 432 h m z
abc 343 c x q
xyz 254 1 d 2
fgg 165 c z d mm
mnd 766 2 d v
df2
df2_Name df2_Id df2_c2 df2_c4
----------------------------------
asd 101 h d2
ygg 521 x cd
fgg 165 o cm
我想将 df1 中的“名称”和“id”与 df2 的“df2_Name”和“df2_id”匹配。 只要找到匹配项,df1 中的“c2”和“c4”的值就会被 df2 中的“df2_c2”和“df2_c4”中的值替换。
期望的输出
Name Id c1 c2 c3 c4
-------------------------------
asd 101 a h c d2
cdf 231 e ? 1
zxs 342 f o
ygg 521 g x p cd
mlk 432 h m z
abc 343 c x q
xyz 254 1 d 2
fgg 165 c o d cm
mnd 766 2 d v
尝试解决方案1
df1[df1.set_index(['Name', 'id']).index.isin(df2.set_index(['df2_Name','df2_id']).index)].iloc[:,[3,5]].update(df2.iloc[:,[2,3]])
结果:原样返回 df1。
尝试解决方案2
df1.loc[df1.set_index(['Name', 'id']).index.isin(df2.set_index(['df2_Name','df2_id']).index), ['c2', 'c4']] = df2[['df2_c2', 'df2_c4']]
结果:引入了 NaN
Name id c1 c2 c3 c4
----------------------------
asd 101 a NaN c NaN
cdf 231 e ? 1
zxs 342 f o
ygg 521 g NaN p NaN
mlk 432 h m z
abc 343 c x q
xyz 254 1 d 2
fgg 165 c NaN d NaN
mnd 766 2 d v
尝试解决方案 3 (仅适用于 c2)
merged = df1.merge(df2, left_on=["id", "Name"], right_on=["df2_id", "df2_Name"])
merged["c2"] = merged.apply(lambda x: x["c2"] if pd.isnull(x["df2_c2"]) else x["df2_c2"], axis=1)
结果:
Name id c1 c2 c3 c4 df2_Name df2_id df2_c2 df2_c4
--------------------------------------------------------------
asd 101 a h c d asd 101 h d2
ygg 521 g x p ygg 521 x cd
fgg 165 c o d mm fgg 165 o cm
此解决方案 3 替换了选定列的值,但是它返回合并的数据框,而不是更新的整个 df1。
谁能帮我解决这个问题?
笔记:
在尝试以下解决方案后提出了这个问题,但是没有成功:
我会使用merge
来加入两个数据框。 然后,您将获得包含旧值的列以及包含新值和 nan 值的列。 之后使用apply
加入这些列:
merged = df1.merge(df2, how='outer', left_on=["id", "name"], right_on=["df2_id", "df2_name"])
merged["c2"] = merged.apply(lambda x: x["c2"] if pd.isnull(x["df2_c2"]) else x["df2_c2"], axis=1)
# Same for c4
# Drop df2_c2 and df2_c4
我目前无法对其进行测试,因此请告诉我这是否适合您。
# Excel file name df1_df2.xlsx with 2 sheets name df1 & df2
# In df2 the column names are 'Name' 'Id' 'c_2' 'c_4'
# In df1 the column names are 'Name' 'Id' 'c1' 'c2' 'c3' 'c4'
import pandas as pd
import openpyxl
import xlsxwriter
url = "df1_df2.xlsx"
df = pd.ExcelFile(url)
df1 = df.parse('df1')
df2 = df.parse('df2')
merged = pd.merge(df1,df2, how='outer', on=['Id'])
merged["c2"] = merged.apply(lambda x: x["c2"] if pd.isnull(x["c_2"])
else x["c_2"], axis=1)
merged.reindex(['Name','Id','c1','c2','c3','c4'], axis=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.