简体   繁体   中英

ReduceByKey after return from function which has return type Either

I need to aggregate 3 different RDDs (in 3 different iterations) on which I am using map to call a function createKeyValuePair which has return type Either[((Long, Int, Int),A), ((Int, Int, Int, Int),A)] ,

sparkSession.sqlContext.read.parquet(inputPathForCut: _*).rdd
      .map(row => createKeyValuePair(row, data, keyElements))
      .reduceByKey((r1,r2) => Utils.aggregateRecords(r1,r2))
      .toDF()
      .coalesce(1)
      .write.format("parquet").option("compression", "gzip")
      .save(OUTPUT_PATH)

But, then reduceByKey is not available, it says cannot resolve reduceByKey .

def createKeyValuePair(row: Row, data: String, elementsInKey: Int) : Either[((Long, Int, Int), A),
                                                                    ((Int, Int, Int, Int, Int), A)] = {
    var keySet = Array[Long]()
    for (i <- 0 to elementsInKey) {
        keySet = keySet :+ row(i).asInstanceOf[Long]
    }
    val record = row(elementsInKey).asInstanceOf[A]
    dataCut match {
        case "string1" => return ... //(first key)
        case "string2" => return ... //(second key)
    }
    null
}

Question 1. How can I use reduceByKey on RDD returned by function call which has return type Either ?

If I change function createKeyValuePair to following,

def createKeyValuePair(row: Row, data: String, elementsInKey: Int) : ((Long*, Int*), A) = {
    var keySet = Array[Long]()
    for (i <- 0 to elementsInKey) {
        keySet = keySet :+ row(i).asInstanceOf[Long]
    }
    val record = row(elementsInKey).asInstanceOf[A]

    ((keySet: _*),record)
}

then reduceByKey works, but it thinks return type is just ((Long,Int),A) , and the function shows error as well, that return type expected is (Long,Int) , but actual is Seq[A] .

Question 2. Is it not possible to have return type as varargs in scala?

Note: The return type and data on which reduceByKey will be applied have same schema. I am not trying to apply reduceByKey on data with different schemas. I will first read file 1 and file 2 and aggregate that, which will have key as (Long,Int,Int) , then in 2nd iteration will read second file which will have key as (Int, Int, Int, Int) and aggregate that.

reduceByKey is only available for RDDs of (key, value) pairs, which you don't actually have (because they're wrapped in the Either ).

One option is to change from RDD[((Long, Int, Int),A), ((Int, Int, Int, Int),A)] to RDD[(Either[(Long, Int, Int), (Int, Int, Int, Int)], A)] .

However, I'm not convinced that you should have a single createKeyValuePair function. The only code you actually share between the two cases is building the array of keys. Imagine instead something like

def getKeyElements(row: Row, recordIndex: Int): List[Long] = {
  (0 until recordIndex).map(row.getLong).toList
}

def createKeyValuePairFirstCase(row: Row): ((Long, Int, Int), A) = {
  val first :: second :: third :: _ = getKeyElements(row, 3)
  ((first, second, third), row.get(3).asInstanceOf[A])
}

and similarly in the second case. I believe (though I'm not sure and didn't check) that you get an implicit conversion from Long to Int in the keys.

A few random notes:

  • We need getKeyElements to return a List to use pull out first , second and third like that. You could also just return a Seq and build the tuple using indices.
  • Note the existence of getLong . There's also getInt and getString and so on.
  • It does feel like there ought to be a way to parameterize the type of the key and write a single function, but I don't know what it is.
  • Have you thought about using the new Dataset API? You might be able to make this easier to read and more robust.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM