[英]Spark: How to map an RDD when access to another RDD is required
Given two large key-valued pair RDDs ( d1
and d2
), both composed of unique ID keys and vector values (eg RDD[Int,DenseVector]
), I need to map d1
in order to obtain for each of its element the ID of the closest element in d2
using a euclidean distance metric between vectors . 给定两个大的键值对RDD( d1
和d2
),两者都由唯一的ID键和矢量值组成(例如RDD[Int,DenseVector]
),我需要映射d1
以便为其每个元素获取ID使用矢量之间的欧氏距离度量在d2
最接近的元素。
I have not found a way to do it using standard RDD transformations. 我还没有找到使用标准RDD转换的方法。 I understand that nested RDDs are not allowed in Spark, however, if it was possible, an easy solution would be: 我知道Spark中不允许嵌套的RDD,但是,如果可能的话,一个简单的解决方案是:
d1.map((k,v) => (k, d2.map{case (k2, v2) => val diff = (v - v2); (k2, sqrt(diff dot diff))}
.takeOrdered(1)(Ordering.by[(Double,Double), Double](_._2))
._1))
Moreover, if d1
was small, I could work with a Map (eg d1.collectAsMap()
) and loop over each of its elements, but this is not an option due to the dataset size. 此外,如果d1
很小,我可以使用Map(例如d1.collectAsMap()
)并遍历其每个元素,但由于数据集大小,这不是一个选项。
Is there any alternative to this transformation in Spark? 在Spark中有这种转换的替代方案吗?
EDIT 1: 编辑1:
Using @holden and @david-griffin suggestions I solved the issue using cartesian()
and reduceByKey()
. 使用@holden和@ david-griffin建议我使用cartesian()
和reduceByKey()
解决了这个问题。 This is the script (assuming sc
as the SparkContext
and the use of the Breeze library). 这是脚本(假设sc
为SparkContext
并使用Breeze库)。
val d1 = sc.parallelize(List((1,DenseVector(0.0,0.0)), (2,DenseVector(1.0,0.0)), (3,DenseVector(0.0,1.0))))
val d2 = sc.parallelize(List((1,DenseVector(0.0,0.75)), (2,DenseVector(0.0,0.25)), (3,DenseVector(1.0,1.0)), (4,DenseVector(0.75,0.0))))
val d1Xd2 = d1.cartesian(d2)
val pairDistances = d1Xd2.map{case ((k1, v1), (k2, v2)) => (k1, (k2, sqrt(sum(pow(v1-v2,2)))))}
val closestPoints = pairDistances.reduceByKey{case (x, y) => if (x._2 < y._2) x else y }
closestPoints.foreach(s => println(s._1 + " -> " + s._2._1))
The output obtained is: 获得的输出是:
1 -> 2
2 -> 4
3 -> 1
Transformations on RDDs can only be applied on the driver side, so nesting of maps won't work. RDD上的转换只能应用于驱动程序端,因此嵌套映射将不起作用。 As @davidgriffin points out you can use cartesian
. 正如@davidgriffin指出你可以使用cartesian
。 For your use case you probably want to follow that up with reduceByKey
and inside of your reduce by key you can keep track of the minimum distance. 对于您的用例,您可能希望使用reduceByKey
跟随它,并且在您的reduce by键中,您可以跟踪最小距离。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.