简体   繁体   中英

RDD, PySpark, Why rdd.flatMap seems does not do any operation in CPU?

Show my code

In [10]: rdd = sc.mongoPairRDD("mongodb://localhost/stackoverflow.stack")

......
     A lot of INFO
......    

In [11]: newrdd = rdd.flatMap(f)

# No INFO

In [12]: newrdd.collect()
# A lot of INFO

When a function of rdd was call, say flatMap , it seems the system doesn't run the code of the function. But when, say call collect() , the system runs and collect all the data from memory or disk?

Am I right?

Yes you are! It is actually the expected behavior for Spark. There are transformations (eg map, flatMap, reduce) and actions (count, collect, saveAsTextFile) that you can apply to an RDD.

As you noted, when you call a transformation, no computation happen, it just stacks the operation to the RDD to get some kind of recipe to produce it. But as soon as you call an action then boom, the RDD is actually evaluated. This is what happens when you call collect.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM