简体   繁体   English

Spark:为什么在rdd映射中未调用我的UDF?

[英]Spark: Why is my UDF not called in rdd map?

I have the following code 我有以下代码

def get_general_popularity_count():
    def test(t):
        a = 1 # this is just a random variable for testing
        print "a"
        pickle.dump(a, open("a.p", "wb"))
    count_dict = pickle.load(open("list.p","rb"))
    rdd = session.sparkContext.parallelize(count_dict)
    rdd.map(lambda x:test(x))

However, nothing is printed, and pickle didn't save a file either. 但是,没有打印任何内容,并且pickle也没有保存文件。 In fact, I know that the UDF was never called because once I had a syntax error in test(x) , but the program never caught it. 实际上,我知道从未调用过UDF,因为一旦我在test(x)遇到语法错误,但程序就再也没有发现它。
So why is my UDF never called? 那么为什么我的UDF从未调用过? Any help is appreciated 任何帮助表示赞赏

It is not called because map is a transformation. 因为map是一个转换,所以未调用它。 Unless it is followed by an action Spark has no reason to execute it at all. 除非执行该操作,否则Spark根本没有理由执行该操作。

Furthermore your code is not a good choice for Apache Spark: 此外,对于Apache Spark,您的代码不是一个不错的选择:

  • print outputs data to the standard output of the worker. print输出数据print到工作程序的标准输出。
  • pickle.dump will write to a local file system of the worker and, when execute like this in map , overwrite output all over again. pickle.dump将写入工作程序的本地文件系统,并在map这样执行时,再次覆盖输出。

You could try RDD.foreach or RDD.saveAsPickleFile 您可以尝试RDD.foreachRDD.saveAsPickleFile

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM