简体   繁体   English

Apache Spark:更新worker中的全局变量

[英]Apache Spark: Update global variables in workers

I am curious if the following simple code will work in distributed environment (it does work properly in standalone)? 我很好奇以下简单的代码是否可以在分布式环境中工作(它在独立的情况下是否可以正常工作)?

public class TestClass {
    private static double[][] testArray = new double[4][];
    public static void main(String[] args) {
        for(int i = 0; i<4; i++)
        {
            testArray[i] = new double[10];
        }
        ...
        JavaRDD<String> testRDD = sc.textFile("testfile", 4).mapPartitionsWithIndex(
            new Function2<Integer, Iterator<String>, Iterator<String> >() {
                @Override
                public Iterator<String> call(Integer ind, Iterator<String> s) {
                    /*Update testArray[ind]*/
                }
            }, true
        );
    ...

If it is supposed to work, I wonder how Spark sends the portion of testArray from worker to master node? 如果它应该工作,我想知道Spark如何将testArray的部分从worker发送到主节点?

No. It's not supposed to work in a distributed environment. 不,它不应该在分布式环境中工作。

Variables captured in a closure will be serialized and sent to the workers. 闭包中捕获的变量将被序列化并发送给工人。 The data initially set at the driver will be available to the workers, but any updates at the worker level will be only accessible in the local scope. 最初在驱动程序中设置的数据将可供工作人员使用,但工作人员级别的任何更新只能在本地范围内访问。

On local, the variable is in the same memory space and therefore you see the updates, but that will not scale to a cluster. 在本地,变量位于相同的内存空间中,因此您可以看到更新,但不会扩展到群集。

You need to convert the computation in terms of RDD operations in order to collect the results. 您需要根据RDD操作转换计算以收集结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM