![](/img/trans.png)
[英]How to join multiple dataframes (reduce function), rename columns to original data frame names?
[英]Join dataframes and rename resulting columns with same names
缩短示例:
vals1 = [(1, "a"),
(2, "b"),
]
columns1 = ["id","name"]
df1 = spark.createDataFrame(data=vals1, schema=columns1)
vals2 = [(1, "k"),
]
columns2 = ["id","name"]
df2 = spark.createDataFrame(data=vals2, schema=columns2)
df1 = df1.alias('df1').join(df2.alias('df2'), 'id', 'full')
df1.show()
结果有一列名为id
和两列名为name
。 假设真实数据框有数十个这样的列,如何重命名具有重复名称的列?
您可以在加入之前重命名列,加入所需的列除外:
import pyspark.sql.functions as F
def add_prefix(df, prefix, exclude=[]):
return df.select(*[F.col(c).alias(prefix+c if c not in exclude else c) for c in df.columns])
def add_suffix(df, suffix, exclude=[]):
return df.select(*[F.col(c).alias(c+suffix if c not in exclude else c) for c in df.columns])
join_cols = ['id']
df1 = add_prefix(df1, 'x_', join_cols)
df2 = add_suffix(df2, '_y', join_cols)
df3 = df1.join(df2, *join_cols, 'full')
df3.show()
+---+------+------+
| id|x_name|name_y|
+---+------+------+
| 1| a| k|
| 2| b| null|
+---+------+------+
@quaziqarta 提出了一种在连接前重命名列的方法,注意你也可以在连接后重命名它们:
join_column = 'id'
df1 = df1.join(df2, join_column, 'full') \
.select(
[join_column] +
[df1.alias('df1')['df1.'+c].alias(c+"_1") for c in df1.columns if c != join_column] +
[df2.alias('df2')['df2.'+c].alias(c+"_2") for c in df2.columns if c != join_column]
) \
.show()
+---+------+------+
| id|name_1|name_2|
+---+------+------+
| 1| a| k|
| 2| b| null|
+---+------+------+
您只需要为数据框起别名(就像您在示例中所做的那样),以便在您要求 Spark 获取列“名称”时能够指定您所指的列。
仅重命名相交列的另一种方法
from typing import List
from pyspark.sql import DataFrame
def join_intersect(df_left: DataFrame, df_right: DataFrame, join_cols: List[str], how: str = 'inner'):
intersected_cols = set(df1.columns).intersection(set(df2.columns))
cols_to_rename = [c for c in intersected_cols if c not in join_cols]
for c in cols_to_rename:
df_left = df_left.withColumnRenamed(c, f"{c}__1")
df_right = df_right.withColumnRenamed(c, f"{c}__2")
return df_left.join(df_right, on=join_cols, how=how)
vals1 = [(1, "a"), (2, "b")]
columns1 = ["id", "name"]
df1 = spark.createDataFrame(data=vals1, schema=columns1)
vals2 = [(1, "k")]
columns2 = ["id", "name"]
df2 = spark.createDataFrame(data=vals2, schema=columns2)
df_joined = join_intersect(df1, df2, ['name'])
df_joined.show()
您可以只使用 for 循环来更改除第二个 dataframe 中的连接列之外的列名称
vals1 = [(1, "a"),
(2, "b"),
]
columns1 = ["id", "name"]
df1 = spark.createDataFrame(data=vals1, schema=columns1)
vals2 = [(1, "k"),
]
columns2 = ["id", "name"]
df2 = spark.createDataFrame(data=vals2, schema=columns2)
for i in df2.columns:
if i != 'id':
df2=df2.withColumnRenamed(i,i+'_1')
df1 = df1.alias('df1').join(df2.alias('df2'), 'id', 'full')
df1.show()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.