Spark sort RDD and join their rank

Question

I have an RDD[(VertexId, Double)] , and I want to sort it by _._2 and join the index(rank) with this RDD. Therefore I can get an element and its rank by filter .

Currently I sort the RDD by sortBy , but I do not know how to join a RDD with its rank. So I collect it as a sequence and zip it with its index. But this is not efficient. I am wondering if there is a more elegant way to do that.

The code I'm using right now are:

val tmpRes = graph.vertices.sortBy(_._2, ascending = false) // Sort all nodes by its PR score in descending order
      .collect() // collect to master, this may be very expensive

    tmpRes.zip(tmpRes.indices) // zip with index

Answer 1

if, by any chance, you'd like to bring back to the driver only n first tuples, then maybe you could use takeOrdered(n, [ordering]) where n is the number of results to bring back and ordering the comparator you'd like to use.

Otherwise, you can use the zipWithIndex transformation that will transform you RDD[(VertexId, Double)] into a RDD[((VertexId, Double), Long)] with the proper index (of course you should do that after your sort).

For example :

scala> val data = sc.parallelize(List(("A", 1), ("B", 2)))
scala> val sorted = data.sortBy(_._2)
scala> sorted.zipWithIndex.collect()
res1: Array[((String, Int), Long)] = Array(((A,1),0), ((B,2),1))

Regards,

Spark sort RDD and join their rank

Question

1 answers

solution1
6 ACCPTED 2015-03-03 18:07:35

Spark sort RDD and join their rank

Question

1 answers

solution1 6 ACCPTED 2015-03-03 18:07:35

solution1
6 ACCPTED 2015-03-03 18:07:35