[英]How to merge two data frames having same column names horizontally on basis of similar values in one column
I have two data frames as shown below:我有两个数据框,如下所示:
A![]() |
B![]() |
C ![]() |
D![]() |
---|---|---|---|
Red![]() |
36 ![]() |
1 ![]() |
type-1![]() |
Blue![]() |
78 ![]() |
2 ![]() |
type-1![]() |
Green![]() |
59 ![]() |
3 ![]() |
type-1![]() |
A![]() |
B![]() |
C ![]() |
D![]() |
---|---|---|---|
Orange![]() |
78 ![]() |
5 ![]() |
type-2 ![]() |
Purple![]() |
59 ![]() |
7 ![]() |
type-2 ![]() |
Brown![]() |
36 ![]() |
9 ![]() |
type-2 ![]() |
I want to merge the above two data frames on the basis of column B and after merge I want to keep the same columns as shown below:我想在 B 列的基础上合并上面的两个数据框,合并后我想保留相同的列,如下所示:
A![]() |
B![]() |
C ![]() |
D![]() |
A![]() |
B![]() |
C ![]() |
D![]() |
---|---|---|---|---|---|---|---|
Red![]() |
36 ![]() |
1 ![]() |
type-1![]() |
Brown![]() |
36 ![]() |
9 ![]() |
type-2 ![]() |
Blue![]() |
78 ![]() |
2 ![]() |
type-1![]() |
Orange![]() |
78 ![]() |
5 ![]() |
type-2 ![]() |
Green![]() |
59 ![]() |
3 ![]() |
type-1![]() |
Purple![]() |
59 ![]() |
7 ![]() |
type-2 ![]() |
Is it possible to do this using pandas or any other python function?是否可以使用 pandas 或任何其他 python 函数来执行此操作?
I have tried using pd.merge function but I needed to change the column names.我试过使用 pd.merge 函数,但我需要更改列名。 There exists another function called pd.concat but can I provide the column name (column 'B') in it for merging?
存在另一个名为 pd.concat 的函数,但我可以在其中提供列名称(“B”列)以进行合并吗?
Thanks a lot in advance!非常感谢!
You can pass to parameters left_on
and right_on
columns from both DataFrames, so is created helper column key_0
, which is removed after join by DataFrame.merge
:您可以将两个 DataFrame 的参数
left_on
和right_on
递给参数,因此创建了辅助列key_0
,它在通过DataFrame.merge
连接后被删除:
Notice: Pandas has problem with duplicated columns names, it is reason why merge
rename them by suffix _x
and _y
注意:Pandas 有重复列名的问题,这就是
merge
后缀重命名它们的原因_x
和_y
df = df1.merge(df2, left_on=df1.B, right_on=df2.B).drop('key_0', axis=1)
print (df)
A_x B_x C_x D_x A_y B_y C_y D_y
0 Red 36 1 type-1 Brown 36 9 type-2
1 Blue 78 2 type-1 Orange 78 5 type-2
2 Green 59 3 type-1 Purple 59 7 type-2
What is problem with same columns names:相同列名有什么问题:
If need select column first A
expected ouput is get Series.如果需要先选择
A
预期的输出是获取系列。
print (df.A_x)
0 Red
1 Blue
2 Green
Name: A_x, dtype: object
But if duplicated names get all columns in DataFrame, DONT DO IT :但是,如果重复的名称获得了DataFrame中的所有列,请不要这样做:
df = df.rename(columns=lambda x: x.split('_')[0])
# print (df)
print (df.A)
A A
0 Red Brown
1 Blue Orange
2 Green Purple
apply rename
to jezrael's anwer and you will get desired output将
rename
应用于 jezrael 的答案,您将获得所需的输出
out = (df1.merge(df2, left_on=df1.B, right_on=df2.B).drop('key_0', axis=1)
.rename(columns=lambda x: x.split('_')[0]))
out
A B C D A B C D
0 Red 36 1 type-1 Brown 36 9 type-2
1 Blue 78 2 type-1 Orange 78 5 type-2
2 Green 59 3 type-1 Purple 59 7 type-2
it's really not a good idea to have duplicated column names, but we can use a multiindex, as for me it has more sence:使用重复的列名确实不是一个好主意,但我们可以使用多索引,因为对我来说它更有意义:
# initial column names after join
Index(['A_x', 'B_x', 'C_x', 'D_x', 'A_y', 'B_y', 'C_y', 'D_y'], dtype='object')
# convert to multiindex
d = df.columns.groupby(df.columns.str.extract('_(.+)')[0])
df.columns = pd.MultiIndex.from_tuples([(k,c.split('_')[0]) for k,v in d.items() for c in v])
# the result
x y
A B C D A B C D
0 Red 36 1 type-1 Brown 36 9 type-2
1 Blue 78 2 type-1 Orange 78 5 type-2
2 Green 59 3 type-1 Purple 59 7 type-2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.