加入数据框并重命名具有相同名称的结果列

Question

缩短示例：

vals1 = [(1, "a"), 
        (2, "b"), 
      ]
columns1 = ["id","name"]
df1 = spark.createDataFrame(data=vals1, schema=columns1)

vals2 = [(1, "k"), 
      ]
columns2 = ["id","name"]
df2 = spark.createDataFrame(data=vals2, schema=columns2)

df1 = df1.alias('df1').join(df2.alias('df2'), 'id', 'full')
df1.show()

结果有一列名为id和两列名为name 。 假设真实数据框有数十个这样的列，如何重命名具有重复名称的列？

Answer 1

您可以在加入之前重命名列，加入所需的列除外：

import pyspark.sql.functions as F

def add_prefix(df, prefix, exclude=[]):
  return df.select(*[F.col(c).alias(prefix+c if c not in exclude else c) for c in df.columns])

def add_suffix(df, suffix, exclude=[]):
  return df.select(*[F.col(c).alias(c+suffix if c not in exclude else c) for c in df.columns])

join_cols = ['id']
df1 = add_prefix(df1, 'x_', join_cols)
df2 = add_suffix(df2, '_y', join_cols)
df3 = df1.join(df2, *join_cols, 'full')
df3.show()
+---+------+------+
| id|x_name|name_y|
+---+------+------+
|  1|     a|     k|
|  2|     b|  null|
+---+------+------+

Answer 2

@quaziqarta 提出了一种在连接前重命名列的方法，注意你也可以在连接后重命名它们：

join_column = 'id'
df1 = df1.join(df2, join_column, 'full') \
         .select(
             [join_column] +
             [df1.alias('df1')['df1.'+c].alias(c+"_1") for c in df1.columns if c != join_column] + 
             [df2.alias('df2')['df2.'+c].alias(c+"_2") for c in df2.columns if c != join_column]
             ) \
         .show()

+---+------+------+
| id|name_1|name_2|
+---+------+------+
|  1|     a|     k|
|  2|     b|  null|
+---+------+------+

您只需要为数据框起别名（就像您在示例中所做的那样），以便在您要求 Spark 获取列“名称”时能够指定您所指的列。

Answer 3

仅重命名相交列的另一种方法

from typing import List

from pyspark.sql import DataFrame


def join_intersect(df_left: DataFrame, df_right: DataFrame, join_cols: List[str], how: str = 'inner'):
    intersected_cols = set(df1.columns).intersection(set(df2.columns))
    cols_to_rename = [c for c in intersected_cols if c not in join_cols]

    for c in cols_to_rename:
        df_left = df_left.withColumnRenamed(c, f"{c}__1")
        df_right = df_right.withColumnRenamed(c, f"{c}__2")

    return df_left.join(df_right, on=join_cols, how=how)


vals1 = [(1, "a"), (2, "b")]
columns1 = ["id", "name"]
df1 = spark.createDataFrame(data=vals1, schema=columns1)
vals2 = [(1, "k")]
columns2 = ["id", "name"]
df2 = spark.createDataFrame(data=vals2, schema=columns2)

df_joined = join_intersect(df1, df2, ['name'])
df_joined.show()

Answer 4

您可以只使用 for 循环来更改除第二个 dataframe 中的连接列之外的列名称

vals1 = [(1, "a"),
         (2, "b"),
         ]
columns1 = ["id", "name"]
df1 = spark.createDataFrame(data=vals1, schema=columns1)

vals2 = [(1, "k"),
         ]
columns2 = ["id", "name"]
df2 = spark.createDataFrame(data=vals2, schema=columns2)
for i in df2.columns:
    if i != 'id':
        df2=df2.withColumnRenamed(i,i+'_1')
df1 = df1.alias('df1').join(df2.alias('df2'), 'id', 'full')
df1.show()

加入数据框并重命名具有相同名称的结果列

问题描述

4 个解决方案

解决方案1
0 2022-08-18 19:45:58

解决方案2
0 2022-08-23 10:44:03

解决方案3
0 2022-08-23 13:04:54

解决方案4
0 2022-08-29 13:31:25

加入数据框并重命名具有相同名称的结果列

问题描述

4 个解决方案

解决方案1 0 2022-08-18 19:45:58

解决方案2 0 2022-08-23 10:44:03

解决方案3 0 2022-08-23 13:04:54

解决方案4 0 2022-08-29 13:31:25

解决方案1
0 2022-08-18 19:45:58

解决方案2
0 2022-08-23 10:44:03

解决方案3
0 2022-08-23 13:04:54

解决方案4
0 2022-08-29 13:31:25