繁体   English   中英

加入三个DF - Scala Spark

[英]Join three DF - Scala Spark

我正在与三个 DF 进行连接。 我有一个问题,因为当我进行连接时,我得到了两次相同的列。

我有这个 DF 加入 df_c2 和 df_c1

DF1
+-------------+----------+-------------+-------------+
|  Date       | NumToJ   |    Type     |   Sport     |
+-------------+----------+-------------+-------------+
|  31/11/2020 |   1911   |     N2      |  football   | 
|  11/01/2020 |   1891   |     L2      |  tennis     |
+-------------+----------+-------------+-------------+

如果 Type 是 N2,我将 DF1 与 df_c1 连接起来,其中 NumToJ 必须等于 Number。

df_c1
+---------+-------------+
|  Number | Description |
+---------+-------------+
|  1131   | DATAquality |
|  9103   | DataToRevise|
|  0192   | NoData      |
+---------+-------------+

如果 Type 是 L2,我将 DF1 与 df_c2 连接起来,其中 NumToJ 必须等于 Number。

df_c2
+---------+-------------+
|  Number | Description |
+-------------+---------+
|  8211   | ReviseAll   |
|  2111   | CancelOperat|
|  9199   | NoData      |
+---------+-------------+

最后我想要这个:

final_df
+-------------+----------+-------------+-------------+-------------+
|  Date       | NumToJ   |    Type     |   Sport     | Description |
+-------------+----------+-------------+-------------+-------------+
|  31/11/2020 |   8211   |     N2      |  football   | ReviseAll   |
|  11/01/2020 |   0192   |     L2      |  tennis     | NoData      |
+-------------+----------+-------------+-------------+-------------+

但我有这个:

+-------------+----------+-------------+-------------+-------------+-------------+
|  Date       | NumToJ   |    Type     |   Sport     | Description | Description |
+-------------+----------+-------------+-------------+-------------+-------------+
|  31/11/2020 |   8211   |     N2      |  football   | ReviseAll   | Null        |
|  11/01/2020 |   0192   |     L2      |  tennis     | Null        | NoData      |
+-------------+----------+-------------+-------------+-------------+-------------+

如何将它放在同一列而不是两列中的说明?

我的代码:

DF1.join(df_c1, col("Type").equalTo("N2") && col("NumToJ") === df_c1("Number"))
.join(df_c2, col("Type").equalTo("L2") && col("NumToJ") === df_c2("Number"))

您可以将数据帧一分为二,并分别与 df_cond1 和 df_cond2 和 union 分别加入以获取所有数据

import spark.implicits._

val df = Seq(
  ("31/11/2020", "8211", "N2", "football"),
  ("1/01/2020", "0192", "L2", "tennis")
).toDF("Date", "NumToJ", "Type", "Sport")

val df_cond1 = Seq(
  ("1131", "DATAquality"),
  ("9103", "DataToRevise"),
  ("0192", "NoData")
).toDF("Number", "Description")

val df_cond2 = Seq(
  ("8211", "ReviseAll"),
  ("2111", "CancelOperat"),
  ("9199", "NoData")
).toDF("Number", "Description")

val df1 = df.filter($"Type" === "L2")
val df2 = df.except(df1)

df1.join(df_cond1, $"NumToJ" === $"Number")
  .union(df2.join(df_cond2, $"NumToJ" === $"Number"))
  .show(false)

输出:

+----------+------+----+--------+------+-----------+
|Date      |NumToJ|Type|Sport   |Number|Description|
+----------+------+----+--------+------+-----------+
|1/01/2020 |0192  |L2  |tennis  |0192  |NoData     |
|31/11/2020|8211  |N2  |football|8211  |ReviseAll  |
+----------+------+----+--------+------+-----------+

您可以在一个数据集中组合所有可能的匹配项,然后通过重命名df_c1df_c2Description列来df_c1 df_c2 ,然后使用条件创建新的 Description 列:

import spark.implicits._
df.join(df_c1.withColumnRenamed("Description", "D1"), 'Number === 'NumToJ, "left")
  .join(df_c2.withColumnRenamed("Description", "D2"), 'Number === 'NumToJ, "left")
  .select('Date,
          'NumToJ,
          'Type,
          'Sport,
          when('Type === "N2", 'D1).when('Type === "L2", 'D2).otherwise(null)
            .as("Description"))

两个连接都是左连接,因此分别与 df_c1 和 df_c2 不匹配的列不会从结果中排除。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM