I am writing an application in Spark Streaming in which I need to calculate the exponential moving average of a double value and add this average to the row. This average is calculated like this:
EMA(t) = EMA(t-1)*0.75 + Value(t)*0.25
Every interval I have one row of data coming in per name:
(name1-24/04/2015 15:31; Observation(name1; 24/04/2015 15:31; 132.45))
(name2-24/04/2015 15:31; Observation(name2; 24/04/2015 15:31; 20.5))
My unique key exists of a name and a timestamp pasted together. Then i have the name and the timestamp separate and then my double value. I will keep track of the exponential moving average for each different name.
I am doing this using updateStateByKey() which works perfectly: (name will be the key during this operation because I need the average per name)
case class Observation(name: String, time: Timestamp, outcome: Double)
val outcomeDstream: DStream[(String, Double)] =
parsedstream.map { case (k: String, obs: Observation) => (obs.name, obs.close) }
def updateEMA(newValues: Seq[Double],oldCount: Option[Double]): Option[Double] = {
if (oldCount.isEmpty) newValues(0)
else Some((newValues(0)*0.25) + (oldCount.get*(0.75)))
}
val ema = outcomeDstream.updateStateByKey[Double](updateEMA _)
The problem I have is this: If I use this function to keep track of my exponential moving average, it will return me:(name, expMovAvg). But I will have lost my unique key and the timestamp. The problem with this is that I can't join this ema-Dstream with my original stream because my key is now just the name which is not unique.
Is there a possibility to keep the unique key or the timestamp during my updateStateByKey transformation?
If I understand your question properly, instead of keeping an Option[Double]
in updateStateByKey
as state, you can use an Option[Observation]
as the state with the name as the key, which will contain all the unique data you need:
val outcomeDstream: DStream[(String, Observation)] =
parsedstream.map { case (k: String, obs: Observation) => (obs.name, obs) }
def updateEMA(newValues: Seq[Observation],
oldCount: Option[Observation]): Option[Observation] = {
if (oldCount.isEmpty) newValues(0)
else Some((newValues(0).outcome * 0.25) + (oldCount.get.outcome * (0.75)))
}
As a side note, if you're using Spark 1.6.0, consider looking into PairDStreamFunctions.mapWithState
. Although having a slightly different semantics (it won't process a key which hasn't received a new value) and still experimental, it is superior in performance .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.