Convert RDD[(K,V) to Map[K,List[V]]

Question

How can i convert a RDD of tuple2 (Key,Value) with duplicate Keys into a Map[K,List[V]] ?

Input example:

val list = List((1,a),(1,b),(2,c),(2,d))
val rdd = sparkContext.parallelize(list)

Output expected:

Map((1,List(a,b)),(2,List(c,d)))

Answer 1

Just use groupByKey , then collectAsMap :

val rdd = sc.parallelize(List((1,"a"),(1,"b"),(2,"c"),(2,"d")))

rdd.groupByKey.collectAsMap
// res1: scala.collection.Map[Int,Iterable[String]] =
//   Map(2 -> CompactBuffer(c, d), 1 -> CompactBuffer(a, b))

Alternatively, use map/reduceByKey then collectAsMap :

rdd.map{ case (k, v) => (k, Seq(v)) }.reduceByKey(_ ++ _).
  collectAsMap
// res2: scala.collection.Map[Int,Seq[String]] =
//   Map(2 -> List(c, d), 1 -> List(a, b))

Answer 2

You can use groupByKey , collectAsMap and map to achieve this like below

val rdd = sc.parallelize(List((1,"a"),(1,"b"),(2,"c"),(2,"d")))
val map=rdd.groupByKey.collectAsMap.map(x=>(x._1,x._2.toList))

Sample output:

Map(2 -> List(c, d), 1 -> List(a, b))

Convert RDD[(K,V) to Map[K,List[V]]

Question

2 answers

solution1
1 ACCPTED 2018-06-13 15:28:52

solution2
0 2018-06-13 15:49:37

Convert RDD[(K,V) to Map[K,List[V]]

Question

2 answers

solution1 1 ACCPTED 2018-06-13 15:28:52

solution2 0 2018-06-13 15:49:37

solution1
1 ACCPTED 2018-06-13 15:28:52

solution2
0 2018-06-13 15:49:37