Working with Spark Partitions

Question

I am new to Spark and have some questions related to Spark RDD operations and creation:

val rdd1 =  sc.parallelize(List("yellow","red","blue","cyan","black"),3)
val mapped =   rdd1.mapPartitionsWithIndex{(index, iterator) => {
                                println("Called in Partition -> " + index)
                                val myList = iterator.toList
                                myList.map(x => x + " -> " + index).iterator
                               }
                            }

What is the use of the .iterator at the end of the above code? Does it converts the list to an iterator? Isn't a list itself an iterator, why do we need this operation in the end?
Also, why is this faster than a normal map() function? Isn't it just another way of working element by element since each element in a partition is again using map(x => x + " -> " + index) function?

Another thing, I want to create an RDD by reading a file 4 lines at a time. I have the following code in Scala:

val hconf = new org.apache.hadoop.conf.Configuration
hconf.set("mapreduce.input.lineinputformat.linespermap","4")
val line = sc.newAPIHadoopFile(inputFile,classOf[NLineInputFormat],classOf[LongWritable],classOf[Text],hconf).map(_._2.toString)
line.take(1).foreach(println)

But the output still prints only one line. Since I have set the hconf to read 4 lines, shouldn't each element in the RDD receive 4 lines of the inptFile? So should it not output four lines?

Answer 1

Why use .iterator?

The function argument to mapPartitions is:

f: Iterator[T] => Iterator[U]

The code you pasted turned each iterator into a list for processing, and needs to turn it back into an iterator at the end of the closure to typecheck properly. Spark operations generally prefer to stream through data rather than have all of it in memory at once, and enforcing that partitions contain Iterator s is part of that model.

Regarding your "list is an iterator" assertion, it's not quite true -- While a List is an Iterable , it's not an Iterator . Iterator s are special in that they can only be consumed once, so they don't support many standard scala Collection operations. The key difference between Iterator and Iterable is this "one shot" model: an Iterable[T] can produce an Iterator[T] as many times as you need, but if you only have an Iterator[T] you can only look at it once.

A more efficient `List` -free implementation

The code you pasted is incredibly inefficient. You end up copying all the data to a list, and then producing an iterator from that list. You can just map the iterator:

val rdd1 =  sc.parallelize(List("yellow","red","blue","cyan","black"),3)
val mapped =   rdd1.mapPartitionsWithIndex{(index, iterator) => {
                            println("Called in Partition -> " + index)
                            iterator.map(x => x + " -> " + index)
                           }
                        }

Lines per map

I think you might be setting the wrong config parameter here. See this question for a possible solution.

Working with Spark Partitions

Question

1 answers

solution1
3 2016-10-29 20:44:48

Why use .iterator?

A more efficient `List` -free implementation

Lines per map

Working with Spark Partitions

Question

1 answers

solution1 3 2016-10-29 20:44:48

Why use .iterator?

A more efficient List -free implementation

Lines per map

solution1
3 2016-10-29 20:44:48

A more efficient `List` -free implementation