简体   繁体   English

加入数据框并重命名具有相同名称的结果列

[英]Join dataframes and rename resulting columns with same names

Shortened example:缩短示例:

vals1 = [(1, "a"), 
        (2, "b"), 
      ]
columns1 = ["id","name"]
df1 = spark.createDataFrame(data=vals1, schema=columns1)

vals2 = [(1, "k"), 
      ]
columns2 = ["id","name"]
df2 = spark.createDataFrame(data=vals2, schema=columns2)

df1 = df1.alias('df1').join(df2.alias('df2'), 'id', 'full')
df1.show()

The result has one column named id and two columns named name .结果有一列名为id和两列名为name How do I rename the columns with duplicate names, assuming that the real dataframes have tens of such columns?假设真实数据框有数十个这样的列,如何重命名具有重复名称的列?

You can rename columns before join, except for columns required for join:您可以在加入之前重命名列,加入所需的列除外:

import pyspark.sql.functions as F

def add_prefix(df, prefix, exclude=[]):
  return df.select(*[F.col(c).alias(prefix+c if c not in exclude else c) for c in df.columns])

def add_suffix(df, suffix, exclude=[]):
  return df.select(*[F.col(c).alias(c+suffix if c not in exclude else c) for c in df.columns])

join_cols = ['id']
df1 = add_prefix(df1, 'x_', join_cols)
df2 = add_suffix(df2, '_y', join_cols)
df3 = df1.join(df2, *join_cols, 'full')
df3.show()
+---+------+------+
| id|x_name|name_y|
+---+------+------+
|  1|     a|     k|
|  2|     b|  null|
+---+------+------+

@quaziqarta proposed a method to rename columns before the join, note that you can also rename them after the join: @quaziqarta 提出了一种在连接前重命名列的方法,注意你也可以在连接后重命名它们:

join_column = 'id'
df1 = df1.join(df2, join_column, 'full') \
         .select(
             [join_column] +
             [df1.alias('df1')['df1.'+c].alias(c+"_1") for c in df1.columns if c != join_column] + 
             [df2.alias('df2')['df2.'+c].alias(c+"_2") for c in df2.columns if c != join_column]
             ) \
         .show()

+---+------+------+
| id|name_1|name_2|
+---+------+------+
|  1|     a|     k|
|  2|     b|  null|
+---+------+------+

You only need to alias the dataframes (as you did in your example) in order to be able to specify which column you are referring when you ask Spark to get the column "name".您只需要为数据框起别名(就像您在示例中所做的那样),以便在您要求 Spark 获取列“名称”时能够指定您所指的列。

Another method to rename only the intersecting columns仅重命名相交列的另一种方法

from typing import List

from pyspark.sql import DataFrame


def join_intersect(df_left: DataFrame, df_right: DataFrame, join_cols: List[str], how: str = 'inner'):
    intersected_cols = set(df1.columns).intersection(set(df2.columns))
    cols_to_rename = [c for c in intersected_cols if c not in join_cols]

    for c in cols_to_rename:
        df_left = df_left.withColumnRenamed(c, f"{c}__1")
        df_right = df_right.withColumnRenamed(c, f"{c}__2")

    return df_left.join(df_right, on=join_cols, how=how)


vals1 = [(1, "a"), (2, "b")]
columns1 = ["id", "name"]
df1 = spark.createDataFrame(data=vals1, schema=columns1)
vals2 = [(1, "k")]
columns2 = ["id", "name"]
df2 = spark.createDataFrame(data=vals2, schema=columns2)

df_joined = join_intersect(df1, df2, ['name'])
df_joined.show()

You can just use a for loop to change the names of columns except the join column in the second dataframe您可以只使用 for 循环来更改除第二个 dataframe 中的连接列之外的列名称

vals1 = [(1, "a"),
         (2, "b"),
         ]
columns1 = ["id", "name"]
df1 = spark.createDataFrame(data=vals1, schema=columns1)

vals2 = [(1, "k"),
         ]
columns2 = ["id", "name"]
df2 = spark.createDataFrame(data=vals2, schema=columns2)
for i in df2.columns:
    if i != 'id':
        df2=df2.withColumnRenamed(i,i+'_1')
df1 = df1.alias('df1').join(df2.alias('df2'), 'id', 'full')
df1.show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM