How to sort an RDD of tuples with 5 elements in Spark Scala?

Question

If I have an RDD of tuples with 5 elements, eg, RDD(Double, String, Int, Double, Double)

How can I sort this RDD efficiently using the fifth element?

I tried to map this RDD into key-value pairs and used sortByKey, but looks like sortByKey is quite slow, it is slower than I collected this RDD and used sortWith on the collected array. Why is it like this?

Thank you very much.

Answer 1

You can do this with sortBy acting directly on the RDD :

myRdd.sortBy(_._5) // Sort by 5th field of each 5-tuple

There are extra optional parameters to define sort order ("ascending") and number of partitions.

Answer 2

If you want to sort by descending order & if the corresponding element is of type int, you can use "-" sign to sort the RDD in descending order.

For ex:

I've a RDD of tuple with (String, Int). To sort this RDD by its 2nd element in descending order,

rdd.sortBy(x => -x._2).collect().foreach(println);

I've a RDD of tuple with (String, String). To sort this RDD by its 2nd element in descending order,

rdd.sortBy(x => x._2, false).collect().foreach(println);

Answer 3

sortByKey is the only distributed sorting API for Spark 1.0.

How much data are you trying to sort? Small amount will result in faster local/centralized sorting. If you try to sort GB and GB of data that may not even fit on a single node, that's where Spark shines.

How to sort an RDD of tuples with 5 elements in Spark Scala?

Question

3 answers

solution1
9 ACCPTED 2015-10-13 07:24:47

solution2
3 2016-06-14 14:54:31

solution3
1 2015-10-14 07:15:30

How to sort an RDD of tuples with 5 elements in Spark Scala?

Question

3 answers

solution1 9 ACCPTED 2015-10-13 07:24:47

solution2 3 2016-06-14 14:54:31

solution3 1 2015-10-14 07:15:30

solution1
9 ACCPTED 2015-10-13 07:24:47

solution2
3 2016-06-14 14:54:31

solution3
1 2015-10-14 07:15:30