简体   繁体   中英

creating pair RDD in spark using scala

Im new to spark so I need to create a RDD with just two element.

Array1 = ((1,1)(1,2)(1,3),(2,1),(2,2),(2,3)

when I execute groupby key the output is ((1,(1,2,3)),(2,(1,2,3))

But I need the output to just have 2 value pair with the key. I'm not sure how to get it.

Expected Output = ((1,(1,2)),(1,(1,3)),(1(2,3),(2(1,2)),(2,(1,3)),(2,(2,3)))

The values should only be printed once. There should only be (1,2) and not (2,1) or like (2,3) not (3,4)

Thanks

You can get the result you require as follows:

// Prior to doing the `groupBy`, you have an RDD[(Int, Int)], x, containing:
//   (1,1),(1,2),(1,3),(2,1),(2,2),(2,3)
//
// Can simply map values as below. Result is a RDD[(Int, (Int, Int))].
val x: RDD[(Int, Int)] = sc.parallelize(Seq((1,1),(1,2),(1,3),(2,1),(2,2),(2,3))
val y: RDD[(Int, (Int, Int))] = x.map(t => (t._1, t)) // Map first value in pair tuple to the tuple
y.collect // Get result as an array
// res0: Array[(Int, (Int, Int))] = Array((1,(1,1)), (1,(1,2)), (1,(1,3)), (2,(2,1)), (2,(2,2)), (2,(2,3)))

That is, the result is a pair RDD that relates the key (the first value of each pair) to the pair (as a tuple ). Do not use groupBy , since—in this case—it will not give you what you want.

If I understand your requirement correctly, you can use groupByKey and flatMapValues to flatten the 2-combinations of the grouped values, as shown below:

val rdd = sc.parallelize(Seq(
  (1, 1), (1,  2), (1 ,3), (2, 1), (2, 2), (2, 3)
))

rdd.groupByKey.flatMapValues(_.toList.combinations(2)).
  map{ case (k, v) => (k, (v(0), v(1))) }.
  collect
// res1: Array[(Int, (Int, Int))] =
//   Array((1,(1,2)), (1,(1,3)), (1,(2,3)), (2,(1,2)), (2,(1,3)), (2,(2,3)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM