简体   繁体   English

如何在 spark scala 中加入 2 rdd

[英]how to join 2 rdd's in spark scala

I have 2 RDD's as below我有 2 个 RDD,如下所示

val rdd1 = spark.sparkContext.parallelize(Seq((123, List(("000000011119",20),("000000011120",30),("000000011121",50))),(234, List(("000000011119",20),("000000011120",30),("000000011121",50)))))
val rdd2 = spark.sparkContext.parallelize(Seq((123, List("000000011119","000000011120")),(234, List("000000011121","000000011120"))))

I want to perform addition of values in rdd1 on the basis of key pairs in rdd2.我想根据 rdd2 中的密钥对在 rdd1 中执行值的加法。

Output required:所需输出:

RDD[(123,50),(234,80)]

Any help will be appreciated.任何帮助将不胜感激。

Really this is a join on the first element of the row, and the first element of each of the contents.实际上,这是对行的第一个元素和每个内容的第一个元素的连接。

So I'd explode it into multiple rows and join that way所以我会把它分解成多行并以这种方式加入

val flat1 = rdd1.flatMap(r => r._2.map(e => ((r._1, e._1), e._2))) // looks like ((234,000000011119),20)
val flat2 = rdd2.flatMap(r => r._2.map(e => ((r._1, e), true))) // looks like ((234,000000011121),true)

val res =  flat1.join(flat2)
  .map(r => (r._1._1, r._2._1))  // looks like (123, 30)
  .reduceByKey(_ + _)  // total each key group

Result with a .foreach(println)结果带有.foreach(println)

scala> :pas
// Entering paste mode (ctrl-D to finish)

flat1.join(flat2)
  .map(r => (r._1._1, r._2._1))  // looks like (123, 30)
  .reduceByKey(_ + _)  // total each key group
  .foreach(println)

// Exiting paste mode, now interpreting.

(123,50)
(234,80)

As usual, this stuff is much simpler using Dataset, so that would be my recommendation for the future though.像往常一样,这些东西使用 Dataset 会简单得多,所以这将是我对未来的建议。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM