是否可以在Apache Spark中创建嵌套的RDD？

Question

I am trying to implement K-nearest neighbor algorithm in Spark. 我试图在Spark中实现K最近邻算法。 I was wondering if it is possible to work with nested RDD's. 我想知道是否可以使用嵌套的RDD。 This will make my life a lot easier. 这将使我的生活更轻松。 Consider the following code snippet. 请考虑以下代码段。

public static void main (String[] args){
//blah blah code
JavaRDD<Double> temp1 = testData.map(
    new Function<Vector,Double>(){
        public Double call(final Vector z) throws Exception{
            JavaRDD<Double> temp2 = trainData.map(
                    new Function<Vector, Double>() {
                        public Double call(Vector vector) throws Exception {
                            return (double) vector.length();
                        }
                    }
            );
            return (double)z.length();
        }    
    }
);
}

Currently I am getting error with this nested settings (I can post here the full log). 目前我收到这个嵌套设置的错误（我可以在这里发布完整的日志）。 Is it allowed in the fist place? 是否允许在拳头位置？ Thanks 谢谢

Answer 1

No, it is not possible, because the items of an RDD must be serializable and a RDD is not serializable. 不，这是不可能的，因为RDD的项目必须是可序列化的，并且RDD不可序列化。 And this makes sense, otherwise you might transfer over the network a whole RDD which is a problem if it contains a lot of data. 这是有道理的，否则你可能会通过网络传输一个完整的RDD，如果它包含大量数据，这是一个问题。 And if it does not contain a lot of data, you might and you should use an array or something like it. 如果它不包含大量数据，您可能应该使用数组或类似的数据。

However, I don't know how you are implementing the K-nearest neighbor...but be careful: if you do something like calculating the distance between each couple of point, this is actually not scalable in the dataset size, because it's O(n2). 但是，我不知道你是如何实现K最近邻居的......但是要小心：如果你做的事情就像计算每对点之间的距离那样，实际上这在数据集大小上是不可扩展的，因为它是O （N2）。

Answer 2

I ran into nullpointer exception while trying something of this sort.As we can't perform operations on RDDs within a RDD. 我在尝试这种事情时遇到了nullpointer异常。因为我们无法在RDD中对RDD执行操作。

Spark doesn't support nesting of RDDs the reason being - to perform an operation or create a new RDD spark runtime requires access to sparkcontext object which is available only in the driver machine. Spark不支持嵌套RDD，原因是 - 执行操作或创建新的RDD spark运行时需要访问仅在驱动程序机器中可用的sparkcontext对象。

Hence if you want to operate on nested RDDs, you may collect the parent RDD on driver node then iterate it's items using array or something. 因此，如果您想对嵌套的RDD进行操作，您可以在驱动程序节点上收集父RDD，然后使用数组或其他东西迭代它的项目。

Note:- RDD class is serializable. 注意： - RDD类是可序列化的。 Please see below. 请看下面。

是否可以在Apache Spark中创建嵌套的RDD？

问题描述

2 个解决方案

解决方案1
4 已采纳 2015-04-21 07:23:51

解决方案2
2

是否可以在Apache Spark中创建嵌套的RDD？

问题描述

2 个解决方案

解决方案1 4 已采纳 2015-04-21 07:23:51

解决方案2 2

解决方案1
4 已采纳 2015-04-21 07:23:51

解决方案2
2