简体   繁体   中英

Apache Spark filter elements

I have two RDDs: points and pointsWithinEps . Their content is in the figures below: 在此处输入图片说明在此处输入图片说明

Vector is representing x, y coordinate. pointsWithinEps represent two points and distance between them. I want to loop all points and for every point filter only that elements which are in the pointsWithinEps as x (first) coordinate. So for the first point it will give [0] and [1] vectors from pointsWithinEps . I have following code:

for (i <- 0 until points.count.toInt) {
  val p = points.take(i + 1).drop(i)
  val currentPointNeighbours = pointsWithinEps.filter {
    case ((x, y), distance) =>
      x == p
  }
  currentPointNeighbours.foreach(println)
  println("----")
}

It does not work correctly. What is wrong with the code?

You can do it efficiently if you transform your RDDs to KV RDDs, and then do join on key. For points key is point itself, for distances key is first point

  import org.apache.spark.SparkContext._

  type Point = DenseVector
  type Distance = ((Point, Point), Double)

  val points: RDD[Point] = ???
  val pointsWithinEps: RDD[Distance] = ???

  // Prepare Tuple2 RDD to enable spark tuple functions
  val pointsToKV: RDD[(Point, Unit)] = points.map(p => p -> ())
  val distance: RDD[(Point, Distance)] = pointsWithinEps.map(distance => distance._1._1 -> distance)

  // Join points with distance
  val filtered: RDD[Distance] = pointsToKV.join(distance) map (_._2._2)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM