根据相同的列名加入多个 Pyspark 数据框

Question

I am new to Pyspark so that is why I am stuck with the following:我是 Pyspark 的新手，所以这就是为什么我坚持以下几点：

I have 5 dataframes and each dataframes has the same Primary Key called concern_code.我有 5 个数据帧，每个数据帧都有相同的主键，称为关注代码。 I need to outer join all this dataframes together and need to drop the 4 columns called concern_code from the 4 dataframes.我需要将所有这些数据帧外部连接在一起，并需要从 4 个数据帧中删除名为关注代码的 4 列。

For example: Dataframe Df1 outer joins Df2 based on concern_code Dataframe Df1 outer joins Df3 based on concern_code and so on..例如：Dataframe Df1 外连接 Df2 基于关注_code Dataframe Df1 外连接 Df3 基于关注_code 等等..

My current Pyspark syntax looks like this:我当前的 Pyspark 语法如下所示：

df1.join(df2,["concern_code"])\
  .join(df3,df1["concern_code"] == df3["concern_code"])\
  .join(df4,df1["concern_code"] == df4["concern_code"])\
  .join(df5,df1["concern_code"] == df5["concern_code"])\
   .show()

How do I need to fix the syntax to perform outer join and to have a final version of a new dataframe that has only one column of concern_code ?我需要如何修复语法以执行外连接并获得只有一列 care_code 的新数据框的最终版本？

Answer 1

You are close.你很近。 let's say you have following dfs:假设您有以下 dfs：

d = [
    ("a", 5.2),
    ("b", 10.4),
    ("c", 7.8),
    ("d", 11.2),
    
]
df1 = spark.createDataFrame(d, ['concern_code','value'])

df2 = spark.createDataFrame(d, ['concern_code','value1'])

df3 = spark.createDataFrame(d, ['concern_code','value2'])

df4 = spark.createDataFrame(d, ['concern_code','value3'])

df5 = spark.createDataFrame(d, ['concern_code','value4'])

df1.show()

# output
+------------+-----+
|concern_code|value|
+------------+-----+
|           a|  5.2|
|           b| 10.4|
|           c|  7.8|
|           d| 11.2|
+------------+-----+


(
    df1
    .join(df2,on="concern_code", how="outer")
    .join(df3,on="concern_code", how="outer")
    .join(df4,on="concern_code", how="outer")
    .join(df5,on="concern_code", how="outer")
   .show()
)

# output
+------------+-----+------+------+------+------+
|concern_code|value|value1|value2|value3|value4|
+------------+-----+------+------+------+------+
|           c|  7.8|   7.8|   7.8|   7.8|   7.8|
|           d| 11.2|  11.2|  11.2|  11.2|  11.2|
|           a|  5.2|   5.2|   5.2|   5.2|   5.2|
|           b| 10.4|  10.4|  10.4|  10.4|  10.4|
+------------+-----+------+------+------+------+

Answer 2

If you join two data frames on columns then the columns will be duplicated, as in your case.如果您在列上连接两个数据框，那么列将被复制，就像您的情况一样。 So I would suggest to use an array of strings, or just a string, ie 'id', for joining two or more data frames.所以我建议使用一个字符串数组，或者只是一个字符串，即“id”，来连接两个或多个数据框。

The code below should not duplicate the column names:下面的代码不应重复列名：

df1.join(df2,on='id', how='outer')\
   .join(df3,on='id', how='outer')\
   .join(df4,on='id', how='outer')\
   .join(df5,on='id'  how='outer')\
   .show()

根据相同的列名加入多个 Pyspark 数据框

问题描述

2 个解决方案

解决方案1
0 2021-10-21 13:27:12

解决方案2
0 2021-10-21 13:27:36

根据相同的列名加入多个 Pyspark 数据框

问题描述

2 个解决方案

解决方案1 0 2021-10-21 13:27:12

解决方案2 0 2021-10-21 13:27:36

解决方案1
0 2021-10-21 13:27:12

解决方案2
0 2021-10-21 13:27:36