[英]Join three DF - Scala Spark
我正在与三个 DF 进行连接。 我有一个问题,因为当我进行连接时,我得到了两次相同的列。
我有这个 DF 加入 df_c2 和 df_c1
DF1
+-------------+----------+-------------+-------------+
| Date | NumToJ | Type | Sport |
+-------------+----------+-------------+-------------+
| 31/11/2020 | 1911 | N2 | football |
| 11/01/2020 | 1891 | L2 | tennis |
+-------------+----------+-------------+-------------+
如果 Type 是 N2,我将 DF1 与 df_c1 连接起来,其中 NumToJ 必须等于 Number。
df_c1
+---------+-------------+
| Number | Description |
+---------+-------------+
| 1131 | DATAquality |
| 9103 | DataToRevise|
| 0192 | NoData |
+---------+-------------+
如果 Type 是 L2,我将 DF1 与 df_c2 连接起来,其中 NumToJ 必须等于 Number。
df_c2
+---------+-------------+
| Number | Description |
+-------------+---------+
| 8211 | ReviseAll |
| 2111 | CancelOperat|
| 9199 | NoData |
+---------+-------------+
最后我想要这个:
final_df
+-------------+----------+-------------+-------------+-------------+
| Date | NumToJ | Type | Sport | Description |
+-------------+----------+-------------+-------------+-------------+
| 31/11/2020 | 8211 | N2 | football | ReviseAll |
| 11/01/2020 | 0192 | L2 | tennis | NoData |
+-------------+----------+-------------+-------------+-------------+
但我有这个:
+-------------+----------+-------------+-------------+-------------+-------------+
| Date | NumToJ | Type | Sport | Description | Description |
+-------------+----------+-------------+-------------+-------------+-------------+
| 31/11/2020 | 8211 | N2 | football | ReviseAll | Null |
| 11/01/2020 | 0192 | L2 | tennis | Null | NoData |
+-------------+----------+-------------+-------------+-------------+-------------+
如何将它放在同一列而不是两列中的说明?
我的代码:
DF1.join(df_c1, col("Type").equalTo("N2") && col("NumToJ") === df_c1("Number"))
.join(df_c2, col("Type").equalTo("L2") && col("NumToJ") === df_c2("Number"))
您可以将数据帧一分为二,并分别与 df_cond1 和 df_cond2 和 union 分别加入以获取所有数据
import spark.implicits._
val df = Seq(
("31/11/2020", "8211", "N2", "football"),
("1/01/2020", "0192", "L2", "tennis")
).toDF("Date", "NumToJ", "Type", "Sport")
val df_cond1 = Seq(
("1131", "DATAquality"),
("9103", "DataToRevise"),
("0192", "NoData")
).toDF("Number", "Description")
val df_cond2 = Seq(
("8211", "ReviseAll"),
("2111", "CancelOperat"),
("9199", "NoData")
).toDF("Number", "Description")
val df1 = df.filter($"Type" === "L2")
val df2 = df.except(df1)
df1.join(df_cond1, $"NumToJ" === $"Number")
.union(df2.join(df_cond2, $"NumToJ" === $"Number"))
.show(false)
输出:
+----------+------+----+--------+------+-----------+
|Date |NumToJ|Type|Sport |Number|Description|
+----------+------+----+--------+------+-----------+
|1/01/2020 |0192 |L2 |tennis |0192 |NoData |
|31/11/2020|8211 |N2 |football|8211 |ReviseAll |
+----------+------+----+--------+------+-----------+
您可以在一个数据集中组合所有可能的匹配项,然后通过重命名df_c1
和df_c2
的Description
列来df_c1
df_c2
,然后使用条件创建新的 Description 列:
import spark.implicits._
df.join(df_c1.withColumnRenamed("Description", "D1"), 'Number === 'NumToJ, "left")
.join(df_c2.withColumnRenamed("Description", "D2"), 'Number === 'NumToJ, "left")
.select('Date,
'NumToJ,
'Type,
'Sport,
when('Type === "N2", 'D1).when('Type === "L2", 'D2).otherwise(null)
.as("Description"))
两个连接都是左连接,因此分别与 df_c1 和 df_c2 不匹配的列不会从结果中排除。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.