RDD, PySpark, Why rdd.flatMap seems does not do any operation in CPU?

Question

Show my code

In [10]: rdd = sc.mongoPairRDD("mongodb://localhost/stackoverflow.stack")

......
     A lot of INFO
......    

In [11]: newrdd = rdd.flatMap(f)

# No INFO

In [12]: newrdd.collect()
# A lot of INFO

When a function of rdd was call, say flatMap , it seems the system doesn't run the code of the function. But when, say call collect() , the system runs and collect all the data from memory or disk?

Am I right?

Answer 1

Yes you are! It is actually the expected behavior for Spark. There are transformations (eg map, flatMap, reduce) and actions (count, collect, saveAsTextFile) that you can apply to an RDD.

As you noted, when you call a transformation, no computation happen, it just stacks the operation to the RDD to get some kind of recipe to produce it. But as soon as you call an action then boom, the RDD is actually evaluated. This is what happens when you call collect.

RDD, PySpark, Why rdd.flatMap seems does not do any operation in CPU?

Question

1 answers

solution1
1 2016-06-19 12:46:58

RDD, PySpark, Why rdd.flatMap seems does not do any operation in CPU?

Question

1 answers

solution1 1 2016-06-19 12:46:58

solution1
1 2016-06-19 12:46:58