简体   繁体   English

是否可以在Apache Spark中创建嵌套的RDD?

[英]Is it possible to create nested RDDs in Apache Spark?

I am trying to implement K-nearest neighbor algorithm in Spark. 我试图在Spark中实现K最近邻算法。 I was wondering if it is possible to work with nested RDD's. 我想知道是否可以使用嵌套的RDD。 This will make my life a lot easier. 这将使我的生活更轻松。 Consider the following code snippet. 请考虑以下代码段。

public static void main (String[] args){
//blah blah code
JavaRDD<Double> temp1 = testData.map(
    new Function<Vector,Double>(){
        public Double call(final Vector z) throws Exception{
            JavaRDD<Double> temp2 = trainData.map(
                    new Function<Vector, Double>() {
                        public Double call(Vector vector) throws Exception {
                            return (double) vector.length();
                        }
                    }
            );
            return (double)z.length();
        }    
    }
);
}

Currently I am getting error with this nested settings (I can post here the full log). 目前我收到这个嵌套设置的错误(我可以在这里发布完整的日志)。 Is it allowed in the fist place? 是否允许在拳头位置? Thanks 谢谢

No, it is not possible, because the items of an RDD must be serializable and a RDD is not serializable. 不,这是不可能的,因为RDD的项目必须是可序列化的,并且RDD不可序列化。 And this makes sense, otherwise you might transfer over the network a whole RDD which is a problem if it contains a lot of data. 这是有道理的,否则你可能会通过网络传输一个完整的RDD,如果它包含大量数据,这是一个问题。 And if it does not contain a lot of data, you might and you should use an array or something like it. 如果它不包含大量数据,您可能应该使用数组或类似的数据。

However, I don't know how you are implementing the K-nearest neighbor...but be careful: if you do something like calculating the distance between each couple of point, this is actually not scalable in the dataset size, because it's O(n2). 但是,我不知道你是如何实现K最近邻居的......但是要小心:如果你做的事情就像计算每对点之间的距离那样,实际上这在数据集大小上是不可扩展的,因为它是O (N2)。

I ran into nullpointer exception while trying something of this sort.As we can't perform operations on RDDs within a RDD. 我在尝试这种事情时遇到了nullpointer异常。因为我们无法在RDD中对RDD执行操作。

Spark doesn't support nesting of RDDs the reason being - to perform an operation or create a new RDD spark runtime requires access to sparkcontext object which is available only in the driver machine. Spark不支持嵌套RDD,原因是 - 执行操作或创建新的RDD spark运行时需要访问仅在驱动程序机器中可用的sparkcontext对象。

Hence if you want to operate on nested RDDs, you may collect the parent RDD on driver node then iterate it's items using array or something. 因此,如果您想对嵌套的RDD进行操作,您可以在驱动程序节点上收集父RDD,然后使用数组或其他东西迭代它的项目。

Note:- RDD class is serializable. 注意: - RDD类是可序列化的。 Please see below. 请看下面。

在此输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM