简体   繁体   中英

Combine two different RDDs with different key in Scala

I have two text file already create as rdd by sparkcontext.

one of them(rdd1) saves related words:

apple,apples
car,cars
computer,computers

Another one(rdd2) saves number of items:

(apple,12)
(apples, 50)
(car,5)
(cars,40)
(computer,77)
(computers,11)

I want to combine those two rdds

disire output:

(apple, 62)
(car,45)
(computer,88)

How to code this?

The meat of the work is to pick a key for the related words. Here I just select the first word but really you could do something more intelligent than just picking a random word.

Explanation:

  1. Create the data
  2. Pick a key for related words
  3. Flatmap the tuples to enable us to join on the key we picked.
  4. Join the RDDs
  5. Map the RDD back into a tuple
  6. Reduce by Key
val s = Seq(("apple","apples"),("car","cars")) // create data
val rdd = sc.parallelize(s)
val t = Seq(("apple",12),("apples", 50),("car",5),("cars",40))// create data
val rdd2 = sc.parallelize(t)
val keyed = rdd.flatMap( {case(a,b) => Seq((a, a),(b,a)) } ) // could be replace with any function that selects the key to use for all of the related words
 .join(rdd2) // complete the join 
 .map({case (_, (a ,b)) => (a,b) }) // recreate a tuple and throw away the related word
 .reduceByKey(_ + _)
 .foreach(println) // to show it works

Even though this solves your problem there are more elegant solutions that you could use with Dataframes you may wish to look into. You could use reduce directly on RDD and skip the step of mapping back to a tuple. I think that would be a better solution but wanted to keep it simple so that it was more illustrative of what I did.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM