[英]Join dataframes and rename resulting columns with same names
Shortened example:缩短示例:
vals1 = [(1, "a"),
(2, "b"),
]
columns1 = ["id","name"]
df1 = spark.createDataFrame(data=vals1, schema=columns1)
vals2 = [(1, "k"),
]
columns2 = ["id","name"]
df2 = spark.createDataFrame(data=vals2, schema=columns2)
df1 = df1.alias('df1').join(df2.alias('df2'), 'id', 'full')
df1.show()
The result has one column named id
and two columns named name
.结果有一列名为id
和两列名为name
。 How do I rename the columns with duplicate names, assuming that the real dataframes have tens of such columns?假设真实数据框有数十个这样的列,如何重命名具有重复名称的列?
You can rename columns before join, except for columns required for join:您可以在加入之前重命名列,加入所需的列除外:
import pyspark.sql.functions as F
def add_prefix(df, prefix, exclude=[]):
return df.select(*[F.col(c).alias(prefix+c if c not in exclude else c) for c in df.columns])
def add_suffix(df, suffix, exclude=[]):
return df.select(*[F.col(c).alias(c+suffix if c not in exclude else c) for c in df.columns])
join_cols = ['id']
df1 = add_prefix(df1, 'x_', join_cols)
df2 = add_suffix(df2, '_y', join_cols)
df3 = df1.join(df2, *join_cols, 'full')
df3.show()
+---+------+------+
| id|x_name|name_y|
+---+------+------+
| 1| a| k|
| 2| b| null|
+---+------+------+
@quaziqarta proposed a method to rename columns before the join, note that you can also rename them after the join: @quaziqarta 提出了一种在连接前重命名列的方法,注意你也可以在连接后重命名它们:
join_column = 'id'
df1 = df1.join(df2, join_column, 'full') \
.select(
[join_column] +
[df1.alias('df1')['df1.'+c].alias(c+"_1") for c in df1.columns if c != join_column] +
[df2.alias('df2')['df2.'+c].alias(c+"_2") for c in df2.columns if c != join_column]
) \
.show()
+---+------+------+
| id|name_1|name_2|
+---+------+------+
| 1| a| k|
| 2| b| null|
+---+------+------+
You only need to alias the dataframes (as you did in your example) in order to be able to specify which column you are referring when you ask Spark to get the column "name".您只需要为数据框起别名(就像您在示例中所做的那样),以便在您要求 Spark 获取列“名称”时能够指定您所指的列。
Another method to rename only the intersecting columns仅重命名相交列的另一种方法
from typing import List
from pyspark.sql import DataFrame
def join_intersect(df_left: DataFrame, df_right: DataFrame, join_cols: List[str], how: str = 'inner'):
intersected_cols = set(df1.columns).intersection(set(df2.columns))
cols_to_rename = [c for c in intersected_cols if c not in join_cols]
for c in cols_to_rename:
df_left = df_left.withColumnRenamed(c, f"{c}__1")
df_right = df_right.withColumnRenamed(c, f"{c}__2")
return df_left.join(df_right, on=join_cols, how=how)
vals1 = [(1, "a"), (2, "b")]
columns1 = ["id", "name"]
df1 = spark.createDataFrame(data=vals1, schema=columns1)
vals2 = [(1, "k")]
columns2 = ["id", "name"]
df2 = spark.createDataFrame(data=vals2, schema=columns2)
df_joined = join_intersect(df1, df2, ['name'])
df_joined.show()
You can just use a for loop to change the names of columns except the join column in the second dataframe您可以只使用 for 循环来更改除第二个 dataframe 中的连接列之外的列名称
vals1 = [(1, "a"),
(2, "b"),
]
columns1 = ["id", "name"]
df1 = spark.createDataFrame(data=vals1, schema=columns1)
vals2 = [(1, "k"),
]
columns2 = ["id", "name"]
df2 = spark.createDataFrame(data=vals2, schema=columns2)
for i in df2.columns:
if i != 'id':
df2=df2.withColumnRenamed(i,i+'_1')
df1 = df1.alias('df1').join(df2.alias('df2'), 'id', 'full')
df1.show()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.