[英]Merging Pandas dataframes on column label and overwriting other values in matched rows
I have these dataframes:我有这些数据框:
rec = pd.DataFrame({'batch': ["001","002","003"],
'A': [1, 2, 3],
'B': [4, 5, 6]})
ing1 = pd.DataFrame({'batch': ["002","003","004"],
'C': [12, 13, 14],
'D': [15, 16, 17],
'E': [18, 19, 10]})
ing2 = pd.DataFrame({'batch': ["001","011","012"],
'C': [20, 21, 22],
'D': [23, 24, 25],
'F': [26, 27, 28]})
What I want is the following merged dataset, where columns with the same label are overwritten by the later merged dateset, and new columns are created for non-existing labels.我想要的是以下合并数据集,其中具有相同标签的列被后来的合并日期集覆盖,并为不存在的标签创建新列。
batch A B C D E F
0 001 1 4 20 23 NaN 26.0
1 002 2 5 12 15 18.0 NaN
2 003 3 6 13 16 19.0 NaN
I have tried to merge rec
with ing1
first:我曾尝试
ing1
rec
与ing1
合并:
final = pd.merge(rec, ing1, how ='left', on='batch', sort=False)
Intermediate result:中间结果:
batch A B C D E
0 001 1 4 NaN NaN NaN
1 002 2 5 12.0 15.0 18.0
2 003 3 6 13.0 16.0 19.0
Then I merge a second time with ing2
, to obtain the missing information in columns C, D and E.然后我第二次与
ing2
合并,以获取 C、D 和 E 列中缺失的信息。
final = pd.merge(final, ing2, how ='left', on='batch', sort=False)
Result (not as expected):结果(不像预期的那样):
batch A B C_x D_x E C_y D_y F
0 001 1 4 NaN NaN NaN 20.0 23.0 26.0
1 002 2 5 12.0 15.0 18.0 NaN NaN NaN
2 003 3 6 13.0 16.0 19.0 NaN NaN NaN
I have also tried merge
, concat
, and combinefirst
, however these seem to operate where they append the data from the second table onto the primary table.我也尝试过
merge
、 concat
和combinefirst
,但是这些似乎是在将第二个表中的数据附加到主表上的地方进行操作。 The only approach I can think of is to split the dataframe into rows that need to pull data from ing1
and rows that need ing2
, then append them to each other for the final dataset.我能想到的唯一方法是将数据帧拆分为需要从
ing1
提取数据的行和需要ing2
行,然后将它们彼此附加以获得最终数据集。
How about just applying np.where()
after merging?合并后只应用
np.where()
怎么样? If the right column (with suffix "_y") is not NA then take the right, else take the left.如果右列(带有后缀“_y”)不是 NA 则走右边,否则走左边。
final = rec.merge(ing1, how='left', on='batch')\
.merge(ing2, how='left', on='batch')
final[["C", "D"]] = np.where(~final[["C_y", "D_y"]].isna(), final[["C_y", "D_y"]], final[["C_x", "D_x"]])
Output输出
print(final[["A","B","C","D","E","F"]])
A B C D E F
0 1 4 20.0 23.0 NaN 26.0
1 2 5 12.0 15.0 18.0 NaN
2 3 6 13.0 16.0 19.0 NaN
Actually, df.update() may be the conceptually closest function to what you're asking for.实际上, df.update()可能是概念上最接近您要求的函数。 However, you have to set index and pre-allocate the output dataframe in advance.
但是,您必须提前设置索引并预先分配输出数据帧。 This may or may not cause more trouble than
.merge()
.这可能会也可能不会比
.merge()
引起更多的麻烦。
Code:代码:
# set index
rec.set_index("batch", inplace=True)
ing1.set_index("batch", inplace=True)
ing2.set_index("batch", inplace=True)
# preallocate
final = pd.DataFrame(columns=["A","B","C","D","E","F"], index=rec.index)
# update in order
final.update(rec)
final.update(ing1)
final.update(ing2)
Result:结果:
print(final)
A B C D E F
batch
001 1 4 20 23 NaN 26
002 2 5 12 15 18 NaN
003 3 6 13 16 19 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.