如何在 spark scala 中加入 2 rdd

Question

I have 2 RDD's as below我有 2 个 RDD，如下所示

val rdd1 = spark.sparkContext.parallelize(Seq((123, List(("000000011119",20),("000000011120",30),("000000011121",50))),(234, List(("000000011119",20),("000000011120",30),("000000011121",50)))))
val rdd2 = spark.sparkContext.parallelize(Seq((123, List("000000011119","000000011120")),(234, List("000000011121","000000011120"))))

I want to perform addition of values in rdd1 on the basis of key pairs in rdd2.我想根据 rdd2 中的密钥对在 rdd1 中执行值的加法。

Output required:所需输出：

RDD[(123,50),(234,80)]

Any help will be appreciated.任何帮助将不胜感激。

Answer 1

Really this is a join on the first element of the row, and the first element of each of the contents.实际上，这是对行的第一个元素和每个内容的第一个元素的连接。

So I'd explode it into multiple rows and join that way所以我会把它分解成多行并以这种方式加入

val flat1 = rdd1.flatMap(r => r._2.map(e => ((r._1, e._1), e._2))) // looks like ((234,000000011119),20)
val flat2 = rdd2.flatMap(r => r._2.map(e => ((r._1, e), true))) // looks like ((234,000000011121),true)

val res =  flat1.join(flat2)
  .map(r => (r._1._1, r._2._1))  // looks like (123, 30)
  .reduceByKey(_ + _)  // total each key group

Result with a .foreach(println)结果带有.foreach(println)

scala> :pas
// Entering paste mode (ctrl-D to finish)

flat1.join(flat2)
  .map(r => (r._1._1, r._2._1))  // looks like (123, 30)
  .reduceByKey(_ + _)  // total each key group
  .foreach(println)

// Exiting paste mode, now interpreting.

(123,50)
(234,80)

As usual, this stuff is much simpler using Dataset, so that would be my recommendation for the future though.像往常一样，这些东西使用 Dataset 会简单得多，所以这将是我对未来的建议。

如何在 spark scala 中加入 2 rdd

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-06-25 13:27:56

如何在 spark scala 中加入 2 rdd

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-06-25 13:27:56

解决方案1
1 已采纳 2021-06-25 13:27:56