简体   繁体   English

Spark:如何在需要访问另一个RDD时映射RDD

[英]Spark: How to map an RDD when access to another RDD is required

Given two large key-valued pair RDDs ( d1 and d2 ), both composed of unique ID keys and vector values (eg RDD[Int,DenseVector] ), I need to map d1 in order to obtain for each of its element the ID of the closest element in d2 using a euclidean distance metric between vectors . 给定两个大的键值对RDD( d1d2 ),两者都由唯一的ID键和矢量值组成(例如RDD[Int,DenseVector] ),我需要映射d1以便为其每个元素获取ID使用矢量之间的欧氏距离度量在d2最接近的元素。

I have not found a way to do it using standard RDD transformations. 我还没有找到使用标准RDD转换的方法。 I understand that nested RDDs are not allowed in Spark, however, if it was possible, an easy solution would be: 我知道Spark中不允许嵌套的RDD,但是,如果可能的话,一个简单的解决方案是:

d1.map((k,v) => (k, d2.map{case (k2, v2) => val diff = (v - v2); (k2, sqrt(diff dot diff))} 
                      .takeOrdered(1)(Ordering.by[(Double,Double), Double](_._2))      
                      ._1))

Moreover, if d1 was small, I could work with a Map (eg d1.collectAsMap() ) and loop over each of its elements, but this is not an option due to the dataset size. 此外,如果d1很小,我可以使用Map(例如d1.collectAsMap() )并遍历其每个元素,但由于数据集大小,这不是一个选项。

Is there any alternative to this transformation in Spark? 在Spark中有这种转换的替代方案吗?

EDIT 1: 编辑1:

Using @holden and @david-griffin suggestions I solved the issue using cartesian() and reduceByKey() . 使用@holden和@ david-griffin建议我使用cartesian()reduceByKey()解决了这个问题。 This is the script (assuming sc as the SparkContext and the use of the Breeze library). 这是脚本(假设scSparkContext并使用Breeze库)。

val d1 = sc.parallelize(List((1,DenseVector(0.0,0.0)), (2,DenseVector(1.0,0.0)), (3,DenseVector(0.0,1.0))))
val d2 = sc.parallelize(List((1,DenseVector(0.0,0.75)), (2,DenseVector(0.0,0.25)), (3,DenseVector(1.0,1.0)), (4,DenseVector(0.75,0.0))))

val d1Xd2 = d1.cartesian(d2)
val pairDistances = d1Xd2.map{case ((k1, v1), (k2, v2)) => (k1, (k2, sqrt(sum(pow(v1-v2,2)))))}
val closestPoints = pairDistances.reduceByKey{case (x, y) => if (x._2 < y._2) x else y }

closestPoints.foreach(s => println(s._1 + " -> " + s._2._1))

The output obtained is: 获得的输出是:

1 -> 2
2 -> 4
3 -> 1

Transformations on RDDs can only be applied on the driver side, so nesting of maps won't work. RDD上的转换只能应用于驱动程序端,因此嵌套映射将不起作用。 As @davidgriffin points out you can use cartesian . 正如@davidgriffin指出你可以使用cartesian For your use case you probably want to follow that up with reduceByKey and inside of your reduce by key you can keep track of the minimum distance. 对于您的用例,您可能希望使用reduceByKey跟随它,并且在您的reduce by键中,您可以跟踪最小距离。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM