简体   繁体   English

在 Scala 中将两个不同的 RDD 与不同的键组合在一起

[英]Combine two different RDDs with different key in Scala

I have two text file already create as rdd by sparkcontext.我有两个文本文件已经由 sparkcontext 创建为 rdd。

one of them(rdd1) saves related words:其中一个(rdd1)保存相关词:

apple,apples
car,cars
computer,computers

Another one(rdd2) saves number of items:另一个(rdd2)保存项目数:

(apple,12)
(apples, 50)
(car,5)
(cars,40)
(computer,77)
(computers,11)

I want to combine those two rdds我想结合这两个 rdd

disire output:危险输出:

(apple, 62)
(car,45)
(computer,88)

How to code this?如何编码?

The meat of the work is to pick a key for the related words.工作的重点是为相关词选择一个关键。 Here I just select the first word but really you could do something more intelligent than just picking a random word.在这里,我只选择第一个单词,但实际上你可以做一些比随机选择一个词更聪明的事情。

Explanation:解释:

  1. Create the data创建数据
  2. Pick a key for related words为相关词选择一个键
  3. Flatmap the tuples to enable us to join on the key we picked.对元组进行平面映射,使我们能够加入我们选择的键。
  4. Join the RDDs加入 RDD
  5. Map the RDD back into a tuple将 RDD 映射回元组
  6. Reduce by Key按键减少
val s = Seq(("apple","apples"),("car","cars")) // create data
val rdd = sc.parallelize(s)
val t = Seq(("apple",12),("apples", 50),("car",5),("cars",40))// create data
val rdd2 = sc.parallelize(t)
val keyed = rdd.flatMap( {case(a,b) => Seq((a, a),(b,a)) } ) // could be replace with any function that selects the key to use for all of the related words
 .join(rdd2) // complete the join 
 .map({case (_, (a ,b)) => (a,b) }) // recreate a tuple and throw away the related word
 .reduceByKey(_ + _)
 .foreach(println) // to show it works

Even though this solves your problem there are more elegant solutions that you could use with Dataframes you may wish to look into.即使这解决了您的问题,您也可以使用更优雅的解决方案与您可能希望研究的 Dataframes 一起使用。 You could use reduce directly on RDD and skip the step of mapping back to a tuple.您可以直接在 RDD 上使用 reduce 并跳过映射回元组的步骤。 I think that would be a better solution but wanted to keep it simple so that it was more illustrative of what I did.我认为这将是一个更好的解决方案,但希望保持简单,以便更能说明我所做的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM