简体   繁体   中英

Scala Spark map by pairs of RDD elements

What is the best practice to iterate through an RDD in Spark getting both the previous and the current element? The same as the reduce function but returning and RDD instead of a single value.

For instance, given:

val rdd = spark.sparkContext.textFile("date_values.txt").
          map {
             case Array(val1, val2, val3) =>
                Element(DateTime.parse(val1), val2.toDouble)
          }

The output should be a new RDD with the differences in val2 attributes:

Diff(date, current.val2 - previous.val2)

With the map function I can only get the current element, and with the reduce function I can only return 1 element not and RDD. I could use the foreach function saving in temporal variables the previous value but I don't think this would follow the Scala-Spark guidelines.

What do you think is the most appropriate way to handle this?

The answer given by Dominic Egger in this thread is what I was looking for:

Spark find previous value on each iteration of RDD

import org.apache.spark.mllib.rdd.RDDFunctions._
sortedRDD.sliding(2)

or using Developer API:

val l = sortedRdd.zipWithIndex.map(kv => (kv._2, kv._1))
val r = sortedRdd.zipWithIndex.map(kv => (kv._2-1, kv._1))
val sliding = l.join(r)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM