[英]Apache Spark filter elements
I have two RDDs: points
and pointsWithinEps
. 我有两个RDD:
points
和pointsWithinEps
。 Their content is in the figures below: 其内容如下图:
Vector is representing x, y
coordinate. 向量表示
x, y
坐标。 pointsWithinEps
represent two points and distance between them. pointsWithinEps
代表两个点以及它们之间的距离。 I want to loop all points
and for every point filter only that elements which are in the pointsWithinEps
as x
(first) coordinate. 我想循环所有
points
并且对于每个点过滤器,仅将pointsWithinEps
中的pointsWithinEps
作为x
(第一个)坐标。 So for the first point it will give [0]
and [1]
vectors from pointsWithinEps
. 因此,对于第一点,它将从
pointsWithinEps
给出[0]
和[1]
向量。 I have following code: 我有以下代码:
for (i <- 0 until points.count.toInt) {
val p = points.take(i + 1).drop(i)
val currentPointNeighbours = pointsWithinEps.filter {
case ((x, y), distance) =>
x == p
}
currentPointNeighbours.foreach(println)
println("----")
}
It does not work correctly. 它不能正常工作。 What is wrong with the code?
代码有什么问题?
You can do it efficiently if you transform your RDDs to KV RDDs, and then do join on key. 如果将RDD转换为KV RDD,然后在键上进行联接,则可以有效地做到这一点。 For points key is point itself, for distances key is first point
对于点,关键是点本身,对于距离,关键是第一点
import org.apache.spark.SparkContext._
type Point = DenseVector
type Distance = ((Point, Point), Double)
val points: RDD[Point] = ???
val pointsWithinEps: RDD[Distance] = ???
// Prepare Tuple2 RDD to enable spark tuple functions
val pointsToKV: RDD[(Point, Unit)] = points.map(p => p -> ())
val distance: RDD[(Point, Distance)] = pointsWithinEps.map(distance => distance._1._1 -> distance)
// Join points with distance
val filtered: RDD[Distance] = pointsToKV.join(distance) map (_._2._2)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.