I need to aggregate 3 different RDDs
(in 3 different iterations) on which I am using map
to call a function createKeyValuePair
which has return type Either[((Long, Int, Int),A), ((Int, Int, Int, Int),A)]
,
sparkSession.sqlContext.read.parquet(inputPathForCut: _*).rdd
.map(row => createKeyValuePair(row, data, keyElements))
.reduceByKey((r1,r2) => Utils.aggregateRecords(r1,r2))
.toDF()
.coalesce(1)
.write.format("parquet").option("compression", "gzip")
.save(OUTPUT_PATH)
But, then reduceByKey
is not available, it says cannot resolve reduceByKey
.
def createKeyValuePair(row: Row, data: String, elementsInKey: Int) : Either[((Long, Int, Int), A),
((Int, Int, Int, Int, Int), A)] = {
var keySet = Array[Long]()
for (i <- 0 to elementsInKey) {
keySet = keySet :+ row(i).asInstanceOf[Long]
}
val record = row(elementsInKey).asInstanceOf[A]
dataCut match {
case "string1" => return ... //(first key)
case "string2" => return ... //(second key)
}
null
}
reduceByKey
on RDD returned by function call which has return type Either
? If I change function createKeyValuePair
to following,
def createKeyValuePair(row: Row, data: String, elementsInKey: Int) : ((Long*, Int*), A) = {
var keySet = Array[Long]()
for (i <- 0 to elementsInKey) {
keySet = keySet :+ row(i).asInstanceOf[Long]
}
val record = row(elementsInKey).asInstanceOf[A]
((keySet: _*),record)
}
then reduceByKey
works, but it thinks return type is just ((Long,Int),A)
, and the function shows error as well, that return type expected is (Long,Int)
, but actual is Seq[A]
.
reduceByKey
will be applied have same schema. I am not trying to apply reduceByKey
on data with different schemas. I will first read file 1 and file 2 and aggregate that, which will have key as (Long,Int,Int)
, then in 2nd iteration will read second file which will have key as (Int, Int, Int, Int)
and aggregate that. reduceByKey
is only available for RDDs of (key, value) pairs, which you don't actually have (because they're wrapped in the Either
).
One option is to change from RDD[((Long, Int, Int),A), ((Int, Int, Int, Int),A)]
to RDD[(Either[(Long, Int, Int), (Int, Int, Int, Int)], A)]
.
However, I'm not convinced that you should have a single createKeyValuePair
function. The only code you actually share between the two cases is building the array of keys. Imagine instead something like
def getKeyElements(row: Row, recordIndex: Int): List[Long] = {
(0 until recordIndex).map(row.getLong).toList
}
def createKeyValuePairFirstCase(row: Row): ((Long, Int, Int), A) = {
val first :: second :: third :: _ = getKeyElements(row, 3)
((first, second, third), row.get(3).asInstanceOf[A])
}
and similarly in the second case. I believe (though I'm not sure and didn't check) that you get an implicit conversion from Long
to Int
in the keys.
A few random notes:
getKeyElements
to return a List
to use pull out first
, second
and third
like that. You could also just return a Seq
and build the tuple using indices. getLong
. There's also getInt
and getString
and so on.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.