简体   繁体   中英

How to sort an RDD of tuples with 5 elements in Spark Scala?

If I have an RDD of tuples with 5 elements, eg, RDD(Double, String, Int, Double, Double)

How can I sort this RDD efficiently using the fifth element?

I tried to map this RDD into key-value pairs and used sortByKey, but looks like sortByKey is quite slow, it is slower than I collected this RDD and used sortWith on the collected array. Why is it like this?

Thank you very much.

You can do this with sortBy acting directly on the RDD :

myRdd.sortBy(_._5) // Sort by 5th field of each 5-tuple

There are extra optional parameters to define sort order ("ascending") and number of partitions.

If you want to sort by descending order & if the corresponding element is of type int, you can use "-" sign to sort the RDD in descending order.

For ex:

I've a RDD of tuple with (String, Int). To sort this RDD by its 2nd element in descending order,

rdd.sortBy(x => -x._2).collect().foreach(println);

I've a RDD of tuple with (String, String). To sort this RDD by its 2nd element in descending order,

rdd.sortBy(x => x._2, false).collect().foreach(println);

sortByKey is the only distributed sorting API for Spark 1.0.

How much data are you trying to sort? Small amount will result in faster local/centralized sorting. If you try to sort GB and GB of data that may not even fit on a single node, that's where Spark shines.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM