简体   繁体   中英

Scala Spark execution of RDD contiguous subsets

It's almost 4 days I'm struggling with this problem and I cannot find an efficient solution.

I have an RDD in Spark in the form RDD[(Int, (Date, Double)] (the first value is just an index).

What do you think is the most efficient way in Spark to get an RDD as output where each element is some sort of function applied to the elements of all subsets composed by n contiguous elements within the input RDD ?

For instance, given as function the average and n = 5 the result should be:

input:  [1.0, 2.0, 3.0, 2.0, 6.0, 4.0, 3.0, 4.0, 3.0, 2.0]
output: [                    2.8, 3.4, 3.6, 3.8, 4.0, 3.2]

Because:

1.0 + 2.0 + 3.0 + 2.0 + 6.0 = 14.0 / 5 = 2.8
2.0 + 3.0 + 2.0 + 6.0 + 4.0 = 17.0 / 5 = 3.4
3.0 + 2.0 + 6.0 + 4.0 + 3.0 = 18.0 / 5 = 3.6
2.0 + 6.0 + 4.0 + 3.0 + 4.0 = 19.0 / 5 = 3.8
6.0 + 4.0 + 3.0 + 4.0 + 3.0 = 20.0 / 5 = 4.0
4.0 + 3.0 + 4.0 + 3.0 + 2.0 = 16.0 / 5 = 3.2

This would be a very easy problem to solve but in Scala and Spark I'm very new and I don't know which would be the best practice in this case.

I tried a lot of solution including a kind of nested map() approach but of course Spark does not allow this behaviour. Some of them work but are not very efficient.

What do you think is the best algorithmic way to solve this problem in Scala Spark?

You can use mllib's sliding function:

import org.apache.spark.mllib.rdd.RDDFunctions._

val rdd = sc.parallelize(Seq(1.0, 2.0, 3.0, 2.0, 6.0, 4.0, 3.0, 4.0, 3.0, 2.0))
def average(x: Array[Double]) = x.sum / x.length
rdd.sliding(5).map(average).collect.mkString(", ") // 2.8, 3.4, 3.6, 3.8, 4.0, 3.2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM