How to use mapPartitions in pyspark

Question

After following the Apache Spark documentation, I tried to experiment with the mapPartition module. In the following code, I expected to see initial RDD as in the function myfunc I am just returning back the iterator after printing the values. But when I do collect on the RDD it is empty.

from pyspark import SparkConf
from pyspark import SparkContext                                                          

def myfunc(it):
    print(it.next())
    return it

def fun1(sc):
    n = 5
    rdd = sc.parallelize([x for x in range(n+1)], n)
    print(rdd.mapPartitions(myfunc).collect())


if __name__ == "__main__":                                                                
    conf = SparkConf().setMaster("local[*]")                                              
    conf = conf.setAppName("TEST2")                                                       
    sc = SparkContext(conf = conf)                                                        
    fun1(sc)

Answer 1

mapPartitions is not relevant here. Iterators (here itertools.chain ) are stateful and can be traversed only once. When you call it.next() you read and discard the first element and what you return is a tail of the sequence.

When partition has only one item (it should be the case for all but one) you effectively discard the whole partition.

A few notes:

Putting anything to stdout inside a task is typically useless.
The way you use next is not portable and cannot be used in Python 3.

How to use mapPartitions in pyspark

Question

1 answers

solution1
1 ACCPTED 2017-03-23 16:05:38

How to use mapPartitions in pyspark

Question

1 answers

solution1 1 ACCPTED 2017-03-23 16:05:38

solution1
1 ACCPTED 2017-03-23 16:05:38