简体   繁体   English

Spark:使用同一RDD中的其他元素映射RDD的元素

[英]Spark: Mapping elements of an RDD using other elements from the same RDD

Suppose I have an this rdd: 假设我有一个rdd:

val r = sc.parallelize(Array(1,4,2,3))

What I want to do is create a mapping. 我要做的是创建映射。 eg: 例如:

r.map(val => val + func(all other elements in r)).

Is this even possible? 这有可能吗?

It's very likely that you will get an exception, eg bellow. 您很有可能会遇到例外,例如波纹管。

rdd = sc.parallelize(range(100))
rdd = rdd.map(lambda x: x + sum(rdd.collect()))

ie you are trying to broadcast the RDD therefore. 即您正在尝试广播RDD

Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. 例外:您似乎正在尝试广播RDD或从操作或转换引用RDD。 RDD transformations and actions can only be invoked by the driver, not inside of other transformations; RDD转换和操作只能由驱动程序调用,而不能在其他转换内部调用。 for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. 例如,rdd1.map(lambda x:rdd2.values.count()* x)无效,因为无法在rdd1.map转换内部执行值转换和计数操作。 For more information, see SPARK-5063. 有关更多信息,请参见SPARK-5063。

To achieve this you would have to do something like this: 为此,您必须执行以下操作:

res = sc.broadcast(rdd.reduce(lambda a,b: a + b))
rdd = rdd.map(lambda x: x + res.value)

Spark already supports Gradient Descent . Spark已经支持Gradient Descent Maybe you can take a look in how they implemented it. 也许您可以看看他们是如何实现的。

I don't know if there is a more efficient alternative, but I would first create some structure like: 我不知道是否有更有效的替代方法,但我首先要创建一些结构,例如:

rdd = sc.parallelize([ (1, [4,2,3]), (4, [1,2,3]), (2, [1,4,3]), (3, [1,4,2]));
rdd = rdd.map(lambda (x,y) => x + func(y));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM