Spark排序RDD并加入他们的行列

Question

I have an RDD[(VertexId, Double)] , and I want to sort it by _._2 and join the index(rank) with this RDD. 我有一个RDD[(VertexId, Double)] ，我想按_._2对其进行排序，并使用此RDD将index（rank）加入。 Therefore I can get an element and its rank by filter . 因此，我可以通过filter获得一个元素及其等级。

Currently I sort the RDD by sortBy , but I do not know how to join a RDD with its rank. 当前，我通过sortBy对RDD进行sortBy ，但是我不知道如何将RDD与它的排名结合起来。 So I collect it as a sequence and zip it with its index. 因此，我将其收集为序列，并使用其索引进行压缩。 But this is not efficient. 但这不是有效的。 I am wondering if there is a more elegant way to do that. 我想知道是否有更优雅的方法可以做到这一点。

The code I'm using right now are: 我现在使用的代码是：

val tmpRes = graph.vertices.sortBy(_._2, ascending = false) // Sort all nodes by its PR score in descending order
      .collect() // collect to master, this may be very expensive

    tmpRes.zip(tmpRes.indices) // zip with index

Answer 1

if, by any chance, you'd like to bring back to the driver only n first tuples, then maybe you could use takeOrdered(n, [ordering]) where n is the number of results to bring back and ordering the comparator you'd like to use. 如果有任何机会，您只想将n个第一个元组带回驱动程序，那么您可以使用takeOrdered（n，[ordering]） ，其中n是要带回并对比较器进行排序的结果数d喜欢使用。

Otherwise, you can use the zipWithIndex transformation that will transform you RDD[(VertexId, Double)] into a RDD[((VertexId, Double), Long)] with the proper index (of course you should do that after your sort). 否则，您可以使用zipWithIndex转换将RDD[(VertexId, Double)]转换为具有适当索引的RDD[((VertexId, Double), Long)] （当然，应该在排序后执行此操作）。

For example : 例如：

scala> val data = sc.parallelize(List(("A", 1), ("B", 2)))
scala> val sorted = data.sortBy(_._2)
scala> sorted.zipWithIndex.collect()
res1: Array[((String, Int), Long)] = Array(((A,1),0), ((B,2),1))

Regards, 问候，

Spark排序RDD并加入他们的行列

问题描述

1 个解决方案

解决方案1
6 已采纳 2015-03-03 18:07:35

Spark排序RDD并加入他们的行列

问题描述

1 个解决方案

解决方案1 6 已采纳 2015-03-03 18:07:35

解决方案1
6 已采纳 2015-03-03 18:07:35