简体   繁体   English

Spark排序RDD并加入他们的行列

[英]Spark sort RDD and join their rank

I have an RDD[(VertexId, Double)] , and I want to sort it by _._2 and join the index(rank) with this RDD. 我有一个RDD[(VertexId, Double)] ,我想按_._2对其进行排序,并使用此RDD将index(rank)加入。 Therefore I can get an element and its rank by filter . 因此,我可以通过filter获得一个元素及其等级。

Currently I sort the RDD by sortBy , but I do not know how to join a RDD with its rank. 当前,我通过sortBy对RDD进行sortBy ,但是我不知道如何将RDD与它的排名结合起来。 So I collect it as a sequence and zip it with its index. 因此,我将其收集为序列,并使用其索引进行压缩。 But this is not efficient. 但这不是有效的。 I am wondering if there is a more elegant way to do that. 我想知道是否有更优雅的方法可以做到这一点。

The code I'm using right now are: 我现在使用的代码是:

val tmpRes = graph.vertices.sortBy(_._2, ascending = false) // Sort all nodes by its PR score in descending order
      .collect() // collect to master, this may be very expensive

    tmpRes.zip(tmpRes.indices) // zip with index

if, by any chance, you'd like to bring back to the driver only n first tuples, then maybe you could use takeOrdered(n, [ordering]) where n is the number of results to bring back and ordering the comparator you'd like to use. 如果有任何机会,您只想将n个第一个元组带回驱动程序,那么您可以使用takeOrdered(n,[ordering]) ,其中n是要带回并对比较器进行排序的结果数d喜欢使用。

Otherwise, you can use the zipWithIndex transformation that will transform you RDD[(VertexId, Double)] into a RDD[((VertexId, Double), Long)] with the proper index (of course you should do that after your sort). 否则,您可以使用zipWithIndex转换将RDD[(VertexId, Double)]转换为具有适当索引的RDD[((VertexId, Double), Long)] (当然,应该在排序后执行此操作)。

For example : 例如 :

scala> val data = sc.parallelize(List(("A", 1), ("B", 2)))
scala> val sorted = data.sortBy(_._2)
scala> sorted.zipWithIndex.collect()
res1: Array[((String, Int), Long)] = Array(((A,1),0), ((B,2),1))

Regards, 问候,

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM