简体   繁体   中英

Spark sort RDD and join their rank

I have an RDD[(VertexId, Double)] , and I want to sort it by _._2 and join the index(rank) with this RDD. Therefore I can get an element and its rank by filter .

Currently I sort the RDD by sortBy , but I do not know how to join a RDD with its rank. So I collect it as a sequence and zip it with its index. But this is not efficient. I am wondering if there is a more elegant way to do that.

The code I'm using right now are:

val tmpRes = graph.vertices.sortBy(_._2, ascending = false) // Sort all nodes by its PR score in descending order
      .collect() // collect to master, this may be very expensive

    tmpRes.zip(tmpRes.indices) // zip with index

if, by any chance, you'd like to bring back to the driver only n first tuples, then maybe you could use takeOrdered(n, [ordering]) where n is the number of results to bring back and ordering the comparator you'd like to use.

Otherwise, you can use the zipWithIndex transformation that will transform you RDD[(VertexId, Double)] into a RDD[((VertexId, Double), Long)] with the proper index (of course you should do that after your sort).

For example :

scala> val data = sc.parallelize(List(("A", 1), ("B", 2)))
scala> val sorted = data.sortBy(_._2)
scala> sorted.zipWithIndex.collect()
res1: Array[((String, Int), Long)] = Array(((A,1),0), ((B,2),1))

Regards,

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM