简体   繁体   中英

How to use mapPartitions in pyspark

After following the Apache Spark documentation, I tried to experiment with the mapPartition module. In the following code, I expected to see initial RDD as in the function myfunc I am just returning back the iterator after printing the values. But when I do collect on the RDD it is empty.

from pyspark import SparkConf
from pyspark import SparkContext                                                          

def myfunc(it):
    print(it.next())
    return it

def fun1(sc):
    n = 5
    rdd = sc.parallelize([x for x in range(n+1)], n)
    print(rdd.mapPartitions(myfunc).collect())


if __name__ == "__main__":                                                                
    conf = SparkConf().setMaster("local[*]")                                              
    conf = conf.setAppName("TEST2")                                                       
    sc = SparkContext(conf = conf)                                                        
    fun1(sc)

mapPartitions is not relevant here. Iterators (here itertools.chain ) are stateful and can be traversed only once. When you call it.next() you read and discard the first element and what you return is a tail of the sequence.

When partition has only one item (it should be the case for all but one) you effectively discard the whole partition.

A few notes:

  • Putting anything to stdout inside a task is typically useless.
  • The way you use next is not portable and cannot be used in Python 3.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM