Apache Spark filter elements

Question

I have two RDDs: points and pointsWithinEps . Their content is in the figures below: 在此处输入图片说明

Vector is representing x, y coordinate. pointsWithinEps represent two points and distance between them. I want to loop all points and for every point filter only that elements which are in the pointsWithinEps as x (first) coordinate. So for the first point it will give [0] and [1] vectors from pointsWithinEps . I have following code:

for (i <- 0 until points.count.toInt) {
  val p = points.take(i + 1).drop(i)
  val currentPointNeighbours = pointsWithinEps.filter {
    case ((x, y), distance) =>
      x == p
  }
  currentPointNeighbours.foreach(println)
  println("----")
}

It does not work correctly. What is wrong with the code?

Answer 1

You can do it efficiently if you transform your RDDs to KV RDDs, and then do join on key. For points key is point itself, for distances key is first point

  import org.apache.spark.SparkContext._

  type Point = DenseVector
  type Distance = ((Point, Point), Double)

  val points: RDD[Point] = ???
  val pointsWithinEps: RDD[Distance] = ???

  // Prepare Tuple2 RDD to enable spark tuple functions
  val pointsToKV: RDD[(Point, Unit)] = points.map(p => p -> ())
  val distance: RDD[(Point, Distance)] = pointsWithinEps.map(distance => distance._1._1 -> distance)

  // Join points with distance
  val filtered: RDD[Distance] = pointsToKV.join(distance) map (_._2._2)

Apache Spark filter elements

Question

1 answers

solution1
2 ACCPTED 2014-10-25 20:00:51

Apache Spark filter elements

Question

1 answers

solution1 2 ACCPTED 2014-10-25 20:00:51

solution1
2 ACCPTED 2014-10-25 20:00:51