Spark dataframe 按 ID 加入聚合

Question

我在加入按 ID 分组的 2 个数据帧时遇到问题

val df1 = Seq(
    (1, 1,100),
    (1, 3,20),
    (2, 5,5),
    (2, 2,10)).toDF("id", "index","value")

  val df2 = Seq(
    (1, 0),
    (2, 0),
    (3, 0),
    (4, 0),
    (5,0)).toDF("index", "value")

df1 通过每个 id 的索引列与 df2 连接

预期结果

ID	指数	价值
1个	1个	100
1个	2个	0
1个	3个	20
1个	4个	0
1个	5个	0
2个	1个	0
2个	2个	10
2个	3个	0
2个	4个	0
2个	5个	5个

请帮我解决这个问题

Answer 1

首先，我会用这个替换你的df2表：

var df2 = Seq(
  (Array(1, 2), Array(1, 2, 3, 4, 5))
).toDF("id", "index")

这允许我们使用explode并自动生成一个对我们有帮助的表格：

df2 = df2
  .withColumn("id", explode(col("id")))
  .withColumn("index", explode(col("index")))

它给出了：

+---+-----+
|id |index|
+---+-----+
|1  |1    |
|1  |2    |
|1  |3    |
|1  |4    |
|1  |5    |
|2  |1    |
|2  |2    |
|2  |3    |
|2  |4    |
|2  |5    |
+---+-----+

现在，我们需要做的就是join您的df1 ，如下所示：

df2 = df2
  .join(df1, Seq("id", "index"), "left")
  .withColumn("value", when(col("value").isNull, 0).otherwise(col("value")))

我们得到这个最终的 output：

+---+-----+-----+
|id |index|value|
+---+-----+-----+
|1  |1    |100  |
|1  |2    |0    |
|1  |3    |20   |
|1  |4    |0    |
|1  |5    |0    |
|2  |1    |0    |
|2  |2    |10   |
|2  |3    |0    |
|2  |4    |0    |
|2  |5    |5    |
+---+-----+-----+

这应该是你想要的。 祝你好运！

Spark dataframe 按 ID 加入聚合

问题描述

1 个解决方案

解决方案1
1 已采纳 2023-01-07 14:27:46

Spark dataframe 按 ID 加入聚合

问题描述

1 个解决方案

解决方案1 1 已采纳 2023-01-07 14:27:46

解决方案1
1 已采纳 2023-01-07 14:27:46