Spark: Using mapPartition with Scala

Question

Lets say I am having the following dataframe:

var randomData = Seq(("a",8),("h",5),("f",3),("a",2),("b",8),("c",3)
val df = sc.parallelize(randomData,2).toDF()

and I am having this function which will be an input for the mapPartition :

def trialIterator(row:Iterator[(String,Int)]): Iterator[(String,Int)] =
    row.toArray.tail.toIterator

And using the map partition:

df.mapPartition(trialIterator)

I am having the following error message:

Type mismatch, expected (Iterator[Row]) => Iterator[NotInferedR], actual: Iterator[(String,Int) => Iterator[(String,Int)]

I can understand that this is happening due to the input, output type of my function but how to solve this?

Answer 1

If you want to get strongly typed input don't use Dataset[Row] ( DataFrame ) but Dataset[T] where T in this particular scenario is (String, Int) . Also don't convert to Array and don't call blindly tail without knowing if partition is empty:

def trialIterator(iter: Iterator[(String, Int)]) = iter.drop(1)

randomData
  .toDS // org.apache.spark.sql.Dataset[(String, Int)]
  .mapPartitions(trialIterator _)

or

randomData.toDF // org.apache.spark.sql.Dataset[Row] 
  .as[(String, Int)] // org.apache.spark.sql.Dataset[(String, Int)]
  .mapPartitions(trialIterator _)

Answer 2

You expecting type Iterator[(String,Int)] while you should expect Iterator[Row]

def trialIterator(row:Iterator[Row]): Iterator[(String,Int)] = {
    row.next()
    row //seems to do the same thing w/o all the conversions
}

Spark: Using mapPartition with Scala

Question

2 answers

solution1
5 ACCPTED 2016-07-29 18:07:34

solution2
0 2016-07-29 18:04:58

Spark: Using mapPartition with Scala

Question

2 answers

solution1 5 ACCPTED 2016-07-29 18:07:34

solution2 0 2016-07-29 18:04:58

solution1
5 ACCPTED 2016-07-29 18:07:34

solution2
0 2016-07-29 18:04:58