简体   繁体   English

将Scala方法转换为Spark

[英]Converting a Scala method to Spark

The below Scala method returns the k nearest neighbours of an Array : 下面的Scala方法返回Array的k个最近邻居:

  def getNearestNeighbours(distances: Array[((String, String), Double)], k: Int, label: String) = {                    //| label: String)List[((String, String), Double)]

    distances.filter(v => v._1._1.equals(label) || v._1._2.equals(label)).sortBy(_._2).take(k)
  }

I want to run this function in parallel. 我想并行运行此功能。 I can try converting the Array to an RDD but type RDD does not support the functions .sortBy(_._2).take(k) Is there a way to emulate this method in Spark/Scala ? 我可以尝试将Array转换为RDD但类型RDD不支持函数.sortBy(_._2).take(k)是否可以在Spark / Scala中模拟此方法?

A possible solution is to modify the method so that the RDD is converted to an Array everytime the method is called, but I think this is computationally expensive for large RDD's ? 一种可能的解决方案是修改该方法,以便每次调用该方法时RDD都转换为Array,但是我认为这对于大型RDD而言在计算上是昂贵的? :

  def getNearestNeighbours(distances: RDD[((String, String), Double)], k: Int, label: String) = {                    //| label: String)List[((String, String), Double)]

    distances.collect.filter(v => v._1._1.equals(label) || v._1._2.equals(label)).sortBy(_._2).take(k)
  }

Do not collect the RDD. 不要collect RDD。 It pulls all the data to one machine. 它将所有数据拉到一台计算机上。 Change your input so it is keyed by the negative distance ( RDD[Double, (String, String)] ) and then use RDD.top(k) . 更改您的输入,使其以距离为键( RDD[Double, (String, String)] ),然后使用RDD.top(k)

RDD does have sortByKey method, which sorts RDDs of pairs by the first element, so if you can create RDD[(Double, (String, String))] instead of RDD[((String, String), Double)] (or just call rdd.map(p => (p._2, p._1) ), you can translate the algorithm directly. It also has take , but the documentation says: RDD确实具有sortByKey方法,该方法按第一个元素对对的RDD进行排序,因此,如果可以创建RDD[(Double, (String, String))]而不是RDD[((String, String), Double)] (或仅调用rdd.map(p => (p._2, p._1) ),您可以直接翻译该算法。它也带有take ,但是文档中说:

Return an array with the first n elements of the dataset. 返回具有数据集的前n个元素的数组。 Note that this is currently not executed in parallel. 请注意,当前这不是并行执行的。 Instead, the driver program computes all the elements. 相反,驱动程序会计算所有元素。

So I wouldn't expect this to work well . 所以我认为这不会很好

Besides, if the data fits on one machine, just working with Arrays (or parallel collections) is likely to be faster. 此外,如果数据适合在一台计算机上,则仅使用数组(或并行集合)可能会更快。 Spark does what it can to minimize overhead, but distributed sorting is going to have some overhead anyway! Spark尽其所能使开销最小化,但是无论如何,分布式排序都会有一些开销!

In addition, sorting the entire array/RDD/other collection if you just need the least n elements is a bad idea (again, especially in cases when you'd want to use Spark). 另外,如果只需要最少的n元素,则对整个数组/ RDD /其他集合进行排序是一个坏主意(同样,尤其是在您想使用Spark的情况下)。 You need a selection algorithm like ones described in Worst-case O(n) algorithm for doing k-selection or In an integer array with N elements , find the minimum k elements? 您需要像最差情况O(n)算法中所述的选择算法来进行k选择,还是在具有N个元素的整数数组中找到最小k个元素? . Unfortunately, they aren't available in Scala standard library or in Spark (that I know of). 不幸的是,它们在Scala标准库或Spark(据我所知)中不可用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM