What is the best practice to iterate through an RDD in Spark getting both the previous and the current element? The same as the reduce
function but returning and RDD instead of a single value.
For instance, given:
val rdd = spark.sparkContext.textFile("date_values.txt").
map {
case Array(val1, val2, val3) =>
Element(DateTime.parse(val1), val2.toDouble)
}
The output should be a new RDD with the differences in val2 attributes:
Diff(date, current.val2 - previous.val2)
With the map
function I can only get the current element, and with the reduce
function I can only return 1 element not and RDD. I could use the foreach
function saving in temporal variables the previous value but I don't think this would follow the Scala-Spark guidelines.
What do you think is the most appropriate way to handle this?
The answer given by Dominic Egger in this thread is what I was looking for:
Spark find previous value on each iteration of RDD
import org.apache.spark.mllib.rdd.RDDFunctions._
sortedRDD.sliding(2)
or using Developer API:
val l = sortedRdd.zipWithIndex.map(kv => (kv._2, kv._1))
val r = sortedRdd.zipWithIndex.map(kv => (kv._2-1, kv._1))
val sliding = l.join(r)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.