简体   繁体   English

如何在不重复列的情况下合并 Pandas 数据框

[英]How to merge Pandas dataframes without duplicating columns

I have data of the form:我有以下形式的数据:

frame1 = pd.DataFrame({'supplier1_match0': ['x'], 'id': [1]})
frame2 = pd.DataFrame({'supplier1_match0': ['2x'], 'id': [2]})

and wish to left join multiple frames to a frame like this:并希望将多个框架左连接到这样的框架:

base_frame = pd.DataFrame({'id':[1,2,3]})

I merge on the id and get:我合并了 id 并得到:

merged = base_frame.merge(frame1, how='left', left_on='id', right_on='id')
merged = merged.merge(frame2, how='left', left_on='id', right_on='id')

   id supplier1_match0_x supplier1_match0_y
0   1                  x                NaN
1   2                NaN                 2x
2   3                NaN                NaN

The column is duplicated and a 'y' is appended.该列被复制并附加了一个“y”。 Here is what I need:这是我需要的:

id, supplier1_match0, ...
1,  x
2,  2x
3, NaN

Is there a simple way to achieve this?有没有简单的方法来实现这一目标? There is a similar question ( Nested dictionary to multiindex dataframe where dictionary keys are column labels ) but the data has a different shape.有一个类似的问题( 嵌套字典到多索引数据框,其中字典键是列标签)但数据具有不同的形状。 Note that I have multiple suppliers and that they have varying numbers of matches, so I can't assume the data will have a "rectangular" shape.请注意,我有多个供应商,并且他们有不同数量的匹配项,因此我不能假设数据将具有“矩形”形状。 Thanks in advance.提前致谢。

Your problem is that you don't really want to just merge everything.你的问题是你真的不想merge所有的东西。 You need to concat your first set of frames, then merge.您需要concat你的第一组帧,然后合并。

import pandas as pd
import numpy as np

base_frame.merge(pd.concat([frame1, frame2]), how='left')

#   id supplier1_match0
#0   1                x
#1   2               2x
#2   3              NaN

Alternatively, you could define base_frame so that it has all of the relevant columns of the other frames and set id to be the index and use .update .或者,您可以定义base_frame以便它具有其他帧的所有相关列,并将id设置为索引并使用.update This ensures base_frame remains the same size, while the above does not.这确保了base_frame保持相同的大小,而上面的则没有。 Though data would be over-written if there are multiple non-null values for a given cell.如果给定单元格有多个非空值,则数据将被覆盖。

base_frame = pd.DataFrame({'id':[1,2,3]}).assign(supplier1_match0 = np.NaN).set_index('id')

for df in [frame1, frame2]:
    base_frame.update(df.set_index('id'))

print(base_frame)

   supplier1_match0
id                 
1                 x
2                2x
3               NaN
newdf_merge= pd.merge(pd.DataFrame(df1), pd.DataFrame(df2), left_on=['common column name from df1'],right_on=['common column name from df2'],how='left')

它对我有用,因此想在这里分享

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM