简体   繁体   中英

Spark Streaming: join back to original stream after UpdateStateByKey

I am writing an application in Spark Streaming in which I need to calculate the exponential moving average of a double value and add this average to the row. This average is calculated like this:

EMA(t) = EMA(t-1)*0.75 + Value(t)*0.25

Every interval I have one row of data coming in per name:

(name1-24/04/2015 15:31; Observation(name1; 24/04/2015 15:31; 132.45))

(name2-24/04/2015 15:31; Observation(name2; 24/04/2015 15:31; 20.5))

My unique key exists of a name and a timestamp pasted together. Then i have the name and the timestamp separate and then my double value. I will keep track of the exponential moving average for each different name.

I am doing this using updateStateByKey() which works perfectly: (name will be the key during this operation because I need the average per name)

case class Observation(name: String, time: Timestamp, outcome: Double)

val outcomeDstream: DStream[(String, Double)] = 
    parsedstream.map { case (k: String, obs: Observation) => (obs.name, obs.close) }

def updateEMA(newValues: Seq[Double],oldCount: Option[Double]): Option[Double] = {
  if (oldCount.isEmpty) newValues(0)
  else Some((newValues(0)*0.25) + (oldCount.get*(0.75)))
}

val ema = outcomeDstream.updateStateByKey[Double](updateEMA _)

The problem I have is this: If I use this function to keep track of my exponential moving average, it will return me:(name, expMovAvg). But I will have lost my unique key and the timestamp. The problem with this is that I can't join this ema-Dstream with my original stream because my key is now just the name which is not unique.

Is there a possibility to keep the unique key or the timestamp during my updateStateByKey transformation?

If I understand your question properly, instead of keeping an Option[Double] in updateStateByKey as state, you can use an Option[Observation] as the state with the name as the key, which will contain all the unique data you need:

val outcomeDstream: DStream[(String, Observation)] = 
    parsedstream.map { case (k: String, obs: Observation) => (obs.name, obs) }

def updateEMA(newValues: Seq[Observation], 
              oldCount: Option[Observation]): Option[Observation] = {
  if (oldCount.isEmpty) newValues(0)
  else Some((newValues(0).outcome * 0.25) + (oldCount.get.outcome * (0.75)))
}

As a side note, if you're using Spark 1.6.0, consider looking into PairDStreamFunctions.mapWithState . Although having a slightly different semantics (it won't process a key which hasn't received a new value) and still experimental, it is superior in performance .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM