[英]Merging dataframes without duplicating columns
I have 3 dataframes with different structures, where one contains the 2 keys to link with the other two ones:我有 3 个具有不同结构的数据帧,其中一个包含与其他两个链接的 2 个键:
df1 = id1 id2 df2 = id1 a b1 c1 c2 df3 = id2 a b1 b2 c1
1 1 1a 1b1 1c1 1c2 11 11a 11b1 11b2 11c1
11 2 2a 2b1 2c1 2c2 12 12a 12b1 12b2 12c1
12 3 3a 3b1 3c1 3c2 13 13a 13b1 13b2 13c1
13 14 14a 14b1 14b2 14c1
2 21 21a 21b1 21b2 21c1
21 22 22a 22b1 22b2 22c1
22 23 23a 23b1 23b2 23c1
31 31a 31b1 31b2 31c1
Then I merge df1
with df2
:然后我将df1
与df2
合并:
df1 = pd.merge(df1, df2, on='id1', how='left')
df1 = id1 id2 a b1 c1 c2
1 1a 1b1 1c1 1c2
11 nan nan nan nan
12 nan nan nan nan
13 nan nan nan nan
2 2a 2b1 2c1 2c2
21 nan nan nan nan
22 nan nan nan nan
But when I merge with df3
I have:但是当我与df3
合并时,我有:
df1 = pd.merge(df1, df3, on='id2', how='left')
df1 = id1 id2 a_x b1_x c1_x c2 a_y b1_y b2 c1_y
1 1a 1b1 1c1 1c2
11 nan nan nan nan 11a 11b1 11b2 11c1
12 nan nan nan nan 12a 12b1 12b2 12c1
13 nan nan nan nan 13a 13b1 13b2 13c1
2 2a 2b1 2c1 2c2
21 nan nan nan nan 21a 21b1 21b2 21c1
22 nan nan nan nan 22a 22b1 22b2 22c1
In a nutshell, when there are overlaping columns between the dataframes being merged, the method creates a new column with the sulfixes.简而言之,当合并的数据帧之间存在重叠列时,该方法会创建一个带有亚硫基的新列。 However, I want the values to be replaced when they are coincidents columns.但是,我希望在它们是重合列时替换这些值。
What I'm trying to get is this:我想要得到的是:
df1 = id1 id2 a b1 c1 c2 b2
1 1a 1b1 1c1 1c2
11 11a 11b1 11c1 11b2
12 12a 12b1 12c1 12b2
13 13a 13b1 13c1 13b2
2 2a 2b1 2c1 2c2
21 21a 21b1 21c1 21b2
22 22a 22b1 22c1 22b2
I also tried to fillna('')
before merging the second time, but I have the same result.我也尝试在第二次合并之前fillna('')
,但我有相同的结果。
try like below尝试如下
df1 = pd.merge(df1, df3, on='id2', how='left')
df1['a']=df1['a_y'].fillna(df1['a_x'])
df1['b']=df1['b_y'].fillna(df1['b_x'])
df1['c1']=df1['c1_y'].fillna(df1['c1_x'])
This is a surprisingly difficult problem in pandas.这在大熊猫中是一个令人惊讶的难题。 I've been trying to deal with it as well.我也一直在努力处理它。 One option is to create a separate dataframe for each individual merge, and then concat them together.一种选择是为每个单独的合并创建一个单独的数据框,然后将它们连接在一起。 I don't think that's too "workaround-y":我不认为这太“解决方法-y”:
df_m1 = pd.merge(df1, df2, on='id1', how='inner') # note it's an inner merge
df_m2 = pd.merge(df1, df3, on='id2', how='inner')
df1 = pd.concat([df_m1, df_m2])
However, there will be one problem: if there were some rows in df1
that couldn't be merged with df2
or df3
that you wanted to keep, they won't have stayed in the example above.但是,会有一个问题:如果df1
中的某些行无法与您想要保留的df2
或df3
合并,则它们将不会保留在上面的示例中。 You'll have to manually add them.您必须手动添加它们。 At this point, it would be great if you could just manually add the rows with indexes that aren't in df_m1
or df_m2
, but the problem is merging doesn't conserve the indexes (see: here ), which really complicates this even further.在这一点上,如果您可以手动添加带有不在df_m1
或df_m2
索引的行, df_m1
df_m2
,但问题是合并不会保存索引(请参阅: 此处),这确实使这进一步复杂化.
So you could modify the above to:因此,您可以将上述内容修改为:
df_m1 = pd.merge(df1, df2, on='id1', how='inner') # note it's an inner merge
df_m2 = pd.merge(df1, df3, on='id2', how='inner')
df1 = pd.concat([df_m1, df_m2, df1[~df1.id1.isin(df2.id1) & ~df1.id2.isin(df3.id2)])
It would be nice if there were a better way to do the last part.如果有更好的方法来完成最后一部分,那就太好了。 This above is loopable if you need to merge an arbitrary number of dataframes too.如果您也需要合并任意数量的数据帧,则上述内容是可循环的。
EDIT: Alternatively, since in the general case, when you want to merge more than 3 dataframes, it will help to do the last part with indexes, you can do the following:编辑:或者,因为在一般情况下,当您想要合并 3 个以上的数据帧时,使用索引完成最后一部分会有所帮助,您可以执行以下操作:
df1['old_index'] = df1.index # this will let you keep the index
df_m1 = pd.merge(df1, df2, on='id1', how='inner') # note it's an inner merge
df_m2 = pd.merge(df1, df3, on='id2', how='inner')
df_other = df1[~df1.old_index.isin(pd.concat([df_m1, df_m2]).old_index)]
df1 = pd.concat([df_m1, df_m2, df_other])
This would be much easier to put in a loop.这将更容易放入循环中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.