简体   繁体   English

在列标签上合并 Pandas 数据框并覆盖匹配行中的其他值

[英]Merging Pandas dataframes on column label and overwriting other values in matched rows

I have these dataframes:我有这些数据框:

rec = pd.DataFrame({'batch': ["001","002","003"], 
                    'A': [1, 2, 3], 
                    'B': [4, 5, 6]})

ing1 = pd.DataFrame({'batch': ["002","003","004"], 
                     'C': [12, 13, 14], 
                     'D': [15, 16, 17], 
                     'E': [18, 19, 10]})

ing2 = pd.DataFrame({'batch': ["001","011","012"],
                     'C': [20, 21, 22], 
                     'D': [23, 24, 25], 
                     'F': [26, 27, 28]})

What I want is the following merged dataset, where columns with the same label are overwritten by the later merged dateset, and new columns are created for non-existing labels.我想要的是以下合并数据集,其中具有相同标签的列被后来的合并日期集覆盖,并为不存在的标签创建新列。

  batch  A  B   C   D     E     F
0   001  1  4  20  23   NaN  26.0
1   002  2  5  12  15  18.0   NaN
2   003  3  6  13  16  19.0   NaN

I have tried to merge rec with ing1 first:我曾尝试ing1 recing1合并:

final = pd.merge(rec, ing1, how ='left', on='batch', sort=False)

Intermediate result:中间结果:

  batch  A  B     C     D     E
0   001  1  4   NaN   NaN   NaN
1   002  2  5  12.0  15.0  18.0
2   003  3  6  13.0  16.0  19.0

Then I merge a second time with ing2 , to obtain the missing information in columns C, D and E.然后我第二次与ing2合并,以获取 C、D 和 E 列中缺失的信息。

final = pd.merge(final, ing2, how ='left', on='batch', sort=False)

Result (not as expected):结果(不像预期的那样):

  batch  A  B   C_x   D_x     E   C_y   D_y     F
0   001  1  4   NaN   NaN   NaN  20.0  23.0  26.0
1   002  2  5  12.0  15.0  18.0   NaN   NaN   NaN
2   003  3  6  13.0  16.0  19.0   NaN   NaN   NaN

I have also tried merge , concat , and combinefirst , however these seem to operate where they append the data from the second table onto the primary table.我也尝试过mergeconcatcombinefirst ,但是这些似乎是在将第二个表中的数据附加到主表上的地方进行操作。 The only approach I can think of is to split the dataframe into rows that need to pull data from ing1 and rows that need ing2 , then append them to each other for the final dataset.我能想到的唯一方法是将数据帧拆分为需要从ing1提取数据的行和需要ing2行,然后将它们彼此附加以获得最终数据集。

How about just applying np.where() after merging?合并后只应用np.where()怎么样? If the right column (with suffix "_y") is not NA then take the right, else take the left.如果右列(带有后缀“_y”)不是 NA 则走右边,否则走左边。

final = rec.merge(ing1, how='left', on='batch')\
           .merge(ing2, how='left', on='batch')
final[["C", "D"]] = np.where(~final[["C_y", "D_y"]].isna(), final[["C_y", "D_y"]], final[["C_x", "D_x"]])

Output输出

print(final[["A","B","C","D","E","F"]])

   A  B     C     D     E     F
0  1  4  20.0  23.0   NaN  26.0
1  2  5  12.0  15.0  18.0   NaN
2  3  6  13.0  16.0  19.0   NaN

Actually, df.update() may be the conceptually closest function to what you're asking for.实际上, df.update()可能是概念上最接近您要求的函数。 However, you have to set index and pre-allocate the output dataframe in advance.但是,您必须提前设置索引并预先分配输出数据帧。 This may or may not cause more trouble than .merge() .这可能会也可能不会比.merge()引起更多的麻烦。

Code:代码:

# set index
rec.set_index("batch", inplace=True)
ing1.set_index("batch", inplace=True)
ing2.set_index("batch", inplace=True)

# preallocate
final = pd.DataFrame(columns=["A","B","C","D","E","F"], index=rec.index)
# update in order
final.update(rec)
final.update(ing1)
final.update(ing2)

Result:结果:

print(final)

       A  B   C   D    E    F
batch                        
001    1  4  20  23  NaN   26
002    2  5  12  15   18  NaN
003    3  6  13  16   19  NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM