Spark：使用同一RDD中的其他元素映射RDD的元素

Question

Suppose I have an this rdd: 假设我有一个rdd：

val r = sc.parallelize(Array(1,4,2,3))

What I want to do is create a mapping. 我要做的是创建映射。 eg: 例如：

r.map(val => val + func(all other elements in r)).

Is this even possible? 这有可能吗？

Answer 1

It's very likely that you will get an exception, eg bellow. 您很有可能会遇到例外，例如波纹管。

rdd = sc.parallelize(range(100))
rdd = rdd.map(lambda x: x + sum(rdd.collect()))

ie you are trying to broadcast the RDD therefore. 即您正在尝试广播RDD 。

Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. 例外：您似乎正在尝试广播RDD或从操作或转换引用RDD。 RDD transformations and actions can only be invoked by the driver, not inside of other transformations; RDD转换和操作只能由驱动程序调用，而不能在其他转换内部调用。 for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. 例如，rdd1.map（lambda x：rdd2.values.count（）* x）无效，因为无法在rdd1.map转换内部执行值转换和计数操作。 For more information, see SPARK-5063. 有关更多信息，请参见SPARK-5063。

To achieve this you would have to do something like this: 为此，您必须执行以下操作：

res = sc.broadcast(rdd.reduce(lambda a,b: a + b))
rdd = rdd.map(lambda x: x + res.value)

Answer 2

Spark already supports Gradient Descent . Spark已经支持Gradient Descent 。 Maybe you can take a look in how they implemented it. 也许您可以看看他们是如何实现的。

Answer 3

I don't know if there is a more efficient alternative, but I would first create some structure like: 我不知道是否有更有效的替代方法，但我首先要创建一些结构，例如：

rdd = sc.parallelize([ (1, [4,2,3]), (4, [1,2,3]), (2, [1,4,3]), (3, [1,4,2]));
rdd = rdd.map(lambda (x,y) => x + func(y));

Spark：使用同一RDD中的其他元素映射RDD的元素

问题描述

3 个解决方案

解决方案1
0 2015-12-17 15:18:09

解决方案2
0 2015-12-17 17:18:50

解决方案3
0 2015-12-17 18:39:31

Spark：使用同一RDD中的其他元素映射RDD的元素

问题描述

3 个解决方案

解决方案1 0 2015-12-17 15:18:09

解决方案2 0 2015-12-17 17:18:50

解决方案3 0 2015-12-17 18:39:31

解决方案1
0 2015-12-17 15:18:09

解决方案2
0 2015-12-17 17:18:50

解决方案3
0 2015-12-17 18:39:31