[英]Spark sort RDD and join their rank
I have an RDD[(VertexId, Double)]
, and I want to sort it by _._2
and join the index(rank) with this RDD. 我有一个
RDD[(VertexId, Double)]
,我想按_._2
对其进行排序,并使用此RDD将index(rank)加入。 Therefore I can get an element and its rank by filter
. 因此,我可以通过
filter
获得一个元素及其等级。
Currently I sort the RDD by sortBy
, but I do not know how to join a RDD with its rank. 当前,我通过
sortBy
对RDD进行sortBy
,但是我不知道如何将RDD与它的排名结合起来。 So I collect it as a sequence and zip it with its index. 因此,我将其收集为序列,并使用其索引进行压缩。 But this is not efficient.
但这不是有效的。 I am wondering if there is a more elegant way to do that.
我想知道是否有更优雅的方法可以做到这一点。
The code I'm using right now are: 我现在使用的代码是:
val tmpRes = graph.vertices.sortBy(_._2, ascending = false) // Sort all nodes by its PR score in descending order
.collect() // collect to master, this may be very expensive
tmpRes.zip(tmpRes.indices) // zip with index
if, by any chance, you'd like to bring back to the driver only n first tuples, then maybe you could use takeOrdered(n, [ordering]) where n is the number of results to bring back and ordering the comparator you'd like to use. 如果有任何机会,您只想将n个第一个元组带回驱动程序,那么您可以使用takeOrdered(n,[ordering]) ,其中n是要带回并对比较器进行排序的结果数d喜欢使用。
Otherwise, you can use the zipWithIndex transformation that will transform you RDD[(VertexId, Double)]
into a RDD[((VertexId, Double), Long)]
with the proper index (of course you should do that after your sort). 否则,您可以使用zipWithIndex转换将
RDD[(VertexId, Double)]
转换为具有适当索引的RDD[((VertexId, Double), Long)]
(当然,应该在排序后执行此操作)。
For example : 例如 :
scala> val data = sc.parallelize(List(("A", 1), ("B", 2)))
scala> val sorted = data.sortBy(_._2)
scala> sorted.zipWithIndex.collect()
res1: Array[((String, Int), Long)] = Array(((A,1),0), ((B,2),1))
Regards, 问候,
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.