简体   繁体   English

Apache Spark过滤器元素

[英]Apache Spark filter elements

I have two RDDs: points and pointsWithinEps . 我有两个RDD: pointspointsWithinEps Their content is in the figures below: 其内容如下图: 在此处输入图片说明在此处输入图片说明

Vector is representing x, y coordinate. 向量表示x, y坐标。 pointsWithinEps represent two points and distance between them. pointsWithinEps代表两个点以及它们之间的距离。 I want to loop all points and for every point filter only that elements which are in the pointsWithinEps as x (first) coordinate. 我想循环所有points并且对于每个点过滤器,仅将pointsWithinEps中的pointsWithinEps作为x (第一个)坐标。 So for the first point it will give [0] and [1] vectors from pointsWithinEps . 因此,对于第一点,它将从pointsWithinEps给出[0][1]向量。 I have following code: 我有以下代码:

for (i <- 0 until points.count.toInt) {
  val p = points.take(i + 1).drop(i)
  val currentPointNeighbours = pointsWithinEps.filter {
    case ((x, y), distance) =>
      x == p
  }
  currentPointNeighbours.foreach(println)
  println("----")
}

It does not work correctly. 它不能正常工作。 What is wrong with the code? 代码有什么问题?

You can do it efficiently if you transform your RDDs to KV RDDs, and then do join on key. 如果将RDD转换为KV RDD,然后在键上进行联接,则可以有效地做到这一点。 For points key is point itself, for distances key is first point 对于点,关键是点本身,对于距离,关键是第一点

  import org.apache.spark.SparkContext._

  type Point = DenseVector
  type Distance = ((Point, Point), Double)

  val points: RDD[Point] = ???
  val pointsWithinEps: RDD[Distance] = ???

  // Prepare Tuple2 RDD to enable spark tuple functions
  val pointsToKV: RDD[(Point, Unit)] = points.map(p => p -> ())
  val distance: RDD[(Point, Distance)] = pointsWithinEps.map(distance => distance._1._1 -> distance)

  // Join points with distance
  val filtered: RDD[Distance] = pointsToKV.join(distance) map (_._2._2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM