简体   繁体   English

在Spark中的RDD中的邻居元素上操作

[英]Operate on neighbor elements in RDD in Spark

As I have a collection: 由于我有一个收藏夹:

List(1, 3,-1, 0, 2, -4, 6)

It's easy to make it sorted as: 很容易将其排序为:

List(-4, -1, 0, 1, 2, 3, 6)

Then I can construct a new collection by compute 6 - 3, 3 - 2, 2 - 1, 1 - 0, and so on like this: 然后,我可以通过计算6-3、3-2、2-1、1-0来构造一个新的集合,依此类推:

for(i <- 0 to list.length -2) yield {
    list(i + 1) - list(i)
}

and get a vector: 并得到一个向量:

Vector(3, 1, 1, 1, 1, 3)

That is, I want to make the next element minus the current element. 也就是说,我要使下一个元素减去当前元素。

But how to implement this in RDD on Spark? 但是如何在Spark的RDD中实现呢?

I know for the collection: 我知道收集:

List(-4, -1, 0, 1, 2, 3, 6)

There will be some partitions of the collection, each partition is ordered, can I do the similar operation on each partition and collect results on each partition together? 集合中会有一些分区,每个分区都是有序的,我可以对每个分区执行类似的操作,然后在每个分区上一起收集结果吗?

The most efficient solution is to use sliding method: 最有效的解决方案是使用sliding方法:

import org.apache.spark.mllib.rdd.RDDFunctions._

val rdd = sc.parallelize(Seq(1, 3,-1, 0, 2, -4, 6))
  .sortBy(identity)
  .sliding(2)
  .map{case Array(x, y) => y - x}

Suppose you have something like 假设您有类似

val seq = sc.parallelize(List(1, 3, -1, 0, 2, -4, 6)).sortBy(identity)

Let's create first collection with index as key like Ton Torres suggested 让我们创建一个以索引为键的第一个集合,例如Ton Torres建议的

val original = seq.zipWithIndex.map(_.swap)

Now we can build collection shifted by one element. 现在,我们可以将集合移动一个元素。

val shifted = original.map { case (idx, v) => (idx - 1, v) }.filter(_._1 >= 0)

Next we can calculate needed differences ordered by index descending 接下来,我们可以按索引降序计算所需的差异

val diffs = original.join(shifted)
      .sortBy(_._1, ascending = false)
      .map { case (idx, (v1, v2)) => v2 - v1 }

So 所以

 println(diffs.collect.toSeq)

shows 表演

WrappedArray(3, 1, 1, 1, 1, 3)

Note that you can skip the sortBy step if reversing is not critical. 请注意,如果反转不是很关键,则可以跳过sortBy步骤。

Also note that for local collection this could be computed much more simple like: 还要注意,对于本地集合,可以这样简单得多地进行计算:

val elems = List(1, 3, -1, 0, 2, -4, 6).sorted  

(elems.tail, elems).zipped.map(_ - _).reverse

But in case of RDD the zip method requires each collection should contain equal element count for each partition. 但是对于RDDzip方法要求每个集合的每个分区应包含相等的元素数。 So if you would implement tail like 所以,如果你将实现tail

val tail = seq.zipWithIndex().filter(_._2 > 0).map(_._1)  

tail.zip(seq) would not work since both collection needs equal count of elements for each partition and we have one element for each partition that should travel to previous partition. tail.zip(seq)将不起作用,因为两个集合的每个分区都需要相等数量的元素,并且每个分区都有一个元素,该元素应该移至上一个分区。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM