Spark：为什么在rdd映射中未调用我的UDF？

Question

I have the following code 我有以下代码

def get_general_popularity_count():
    def test(t):
        a = 1 # this is just a random variable for testing
        print "a"
        pickle.dump(a, open("a.p", "wb"))
    count_dict = pickle.load(open("list.p","rb"))
    rdd = session.sparkContext.parallelize(count_dict)
    rdd.map(lambda x:test(x))

However, nothing is printed, and pickle didn't save a file either. 但是，没有打印任何内容，并且pickle也没有保存文件。 In fact, I know that the UDF was never called because once I had a syntax error in test(x) , but the program never caught it. 实际上，我知道从未调用过UDF，因为一旦我在test(x)遇到语法错误，但程序就再也没有发现它。
So why is my UDF never called? 那么为什么我的UDF从未调用过？ Any help is appreciated 任何帮助表示赞赏

Answer 1

It is not called because map is a transformation. 因为map是一个转换，所以未调用它。 Unless it is followed by an action Spark has no reason to execute it at all. 除非执行该操作，否则Spark根本没有理由执行该操作。

Furthermore your code is not a good choice for Apache Spark: 此外，对于Apache Spark，您的代码不是一个不错的选择：

print outputs data to the standard output of the worker. print输出数据print到工作程序的标准输出。
pickle.dump will write to a local file system of the worker and, when execute like this in map , overwrite output all over again. pickle.dump将写入工作程序的本地文件系统，并在map这样执行时，再次覆盖输出。

You could try RDD.foreach or RDD.saveAsPickleFile 您可以尝试RDD.foreach或RDD.saveAsPickleFile

Spark：为什么在rdd映射中未调用我的UDF？

问题描述

1 个解决方案

解决方案1
2 2017-07-28 18:47:53

Spark：为什么在rdd映射中未调用我的UDF？

问题描述

1 个解决方案

解决方案1 2 2017-07-28 18:47:53

解决方案1
2 2017-07-28 18:47:53