简体   繁体   中英

Spark: Using mapPartition with Scala

Lets say I am having the following dataframe:

var randomData = Seq(("a",8),("h",5),("f",3),("a",2),("b",8),("c",3)
val df = sc.parallelize(randomData,2).toDF()

and I am having this function which will be an input for the mapPartition :

def trialIterator(row:Iterator[(String,Int)]): Iterator[(String,Int)] =
    row.toArray.tail.toIterator

And using the map partition:

df.mapPartition(trialIterator)

I am having the following error message:

Type mismatch, expected (Iterator[Row]) => Iterator[NotInferedR], actual: Iterator[(String,Int) => Iterator[(String,Int)]

I can understand that this is happening due to the input, output type of my function but how to solve this?

If you want to get strongly typed input don't use Dataset[Row] ( DataFrame ) but Dataset[T] where T in this particular scenario is (String, Int) . Also don't convert to Array and don't call blindly tail without knowing if partition is empty:

def trialIterator(iter: Iterator[(String, Int)]) = iter.drop(1)

randomData
  .toDS // org.apache.spark.sql.Dataset[(String, Int)]
  .mapPartitions(trialIterator _)

or

randomData.toDF // org.apache.spark.sql.Dataset[Row] 
  .as[(String, Int)] // org.apache.spark.sql.Dataset[(String, Int)]
  .mapPartitions(trialIterator _)

You expecting type Iterator[(String,Int)] while you should expect Iterator[Row]

def trialIterator(row:Iterator[Row]): Iterator[(String,Int)] = {
    row.next()
    row //seems to do the same thing w/o all the conversions
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM