如何根据一列中的相似值水平合并具有相同列名的两个数据框

Question

I have two data frames as shown below:我有两个数据框，如下所示：

A一种	B乙	C C	D丁
Red红色的	36 36	1 1个	type-1类型 1
Blue蓝色的	78 78	2 2个	type-1类型 1
Green绿色的	59 59	3 3个	type-1类型 1

A一种	B乙	C C	D丁
Orange橙子	78 78	5 5个	type-2 2型
Purple紫色的	59 59	7 7	type-2 2型
Brown棕色的	36 36	9 9	type-2 2型

I want to merge the above two data frames on the basis of column B and after merge I want to keep the same columns as shown below:我想在 B 列的基础上合并上面的两个数据框，合并后我想保留相同的列，如下所示：

A一种	B乙	C C	D丁	A一种	B乙	C C	D丁
Red红色的	36 36	1 1个	type-1类型 1	Brown棕色的	36 36	9 9	type-2 2型
Blue蓝色的	78 78	2 2个	type-1类型 1	Orange橙子	78 78	5 5个	type-2 2型
Green绿色的	59 59	3 3个	type-1类型 1	Purple紫色的	59 59	7 7	type-2 2型

Is it possible to do this using pandas or any other python function?是否可以使用 pandas 或任何其他 python 函数来执行此操作？

I have tried using pd.merge function but I needed to change the column names.我试过使用 pd.merge 函数，但我需要更改列名。 There exists another function called pd.concat but can I provide the column name (column 'B') in it for merging?存在另一个名为 pd.concat 的函数，但我可以在其中提供列名称（“B”列）以进行合并吗？

Thanks a lot in advance!非常感谢！

Answer 1

You can pass to parameters left_on and right_on columns from both DataFrames, so is created helper column key_0 , which is removed after join by DataFrame.merge :您可以将两个 DataFrame 的参数left_on和right_on递给参数，因此创建了辅助列key_0 ，它在通过DataFrame.merge连接后被删除：

Notice: Pandas has problem with duplicated columns names, it is reason why merge rename them by suffix _x and _y注意：Pandas 有重复列名的问题，这就是merge后缀重命名它们的原因_x和_y

df = df1.merge(df2, left_on=df1.B, right_on=df2.B).drop('key_0', axis=1)
print (df)
     A_x  B_x  C_x     D_x     A_y  B_y  C_y     D_y
0    Red   36    1  type-1   Brown   36    9  type-2
1   Blue   78    2  type-1  Orange   78    5  type-2
2  Green   59    3  type-1  Purple   59    7  type-2

What is problem with same columns names:相同列名有什么问题：

If need select column first A expected ouput is get Series.如果需要先选择A预期的输出是获取系列。

print (df.A_x)
0      Red
1     Blue
2    Green
Name: A_x, dtype: object

But if duplicated names get all columns in DataFrame, DONT DO IT :但是，如果重复的名称获得了DataFrame中的所有列，请不要这样做：

df = df.rename(columns=lambda x: x.split('_')[0])
# print (df)

print (df.A)
       A       A
0    Red   Brown
1   Blue  Orange
2  Green  Purple

Answer 2

apply rename to jezrael's anwer and you will get desired output将rename应用于 jezrael 的答案，您将获得所需的输出

out = (df1.merge(df2, left_on=df1.B, right_on=df2.B).drop('key_0', axis=1)
       .rename(columns=lambda x: x.split('_')[0]))

out

    A       B   C   D       A       B   C   D
0   Red     36  1   type-1  Brown   36  9   type-2
1   Blue    78  2   type-1  Orange  78  5   type-2
2   Green   59  3   type-1  Purple  59  7   type-2

Answer 3

it's really not a good idea to have duplicated column names, but we can use a multiindex, as for me it has more sence:使用重复的列名确实不是一个好主意，但我们可以使用多索引，因为对我来说它更有意义：

# initial column names after join
Index(['A_x', 'B_x', 'C_x', 'D_x', 'A_y', 'B_y', 'C_y', 'D_y'], dtype='object')

# convert to multiindex
d = df.columns.groupby(df.columns.str.extract('_(.+)')[0])
df.columns = pd.MultiIndex.from_tuples([(k,c.split('_')[0]) for k,v in d.items() for c in v])

# the result
       x                      y               
       A   B  C       D       A   B  C       D
0    Red  36  1  type-1   Brown  36  9  type-2
1   Blue  78  2  type-1  Orange  78  5  type-2
2  Green  59  3  type-1  Purple  59  7  type-2

如何根据一列中的相似值水平合并具有相同列名的两个数据框

问题描述

3 个解决方案

解决方案1
1 2022-12-20 11:04:23

解决方案2
1 2022-12-20 11:24:59

解决方案3
0 2022-12-20 12:54:05

如何根据一列中的相似值水平合并具有相同列名的两个数据框

问题描述

3 个解决方案

解决方案1 1 2022-12-20 11:04:23

解决方案2 1 2022-12-20 11:24:59

解决方案3 0 2022-12-20 12:54:05

解决方案1
1 2022-12-20 11:04:23

解决方案2
1 2022-12-20 11:24:59

解决方案3
0 2022-12-20 12:54:05