简体   繁体   English

Spark流:在UpdateStateByKey之后重新加入原始流

[英]Spark Streaming: join back to original stream after UpdateStateByKey

I am writing an application in Spark Streaming in which I need to calculate the exponential moving average of a double value and add this average to the row. 我在Spark Streaming中编写一个应用程序,其中我需要计算一个双值的指数移动平均值并将该平均值添加到行中。 This average is calculated like this: 该平均值的计算如下:

EMA(t) = EMA(t-1)*0.75 + Value(t)*0.25 EMA(t)= EMA(t-1)* 0.75 +值(t)* 0.25

Every interval I have one row of data coming in per name: 每个时间间隔,我每个名字都有一行数据:

(name1-24/04/2015 15:31; Observation(name1; 24/04/2015 15:31; 132.45)) (name1-24 / 04/2015 15:31;观察(name1; 24/04/2015 15:31; 132.45))

(name2-24/04/2015 15:31; Observation(name2; 24/04/2015 15:31; 20.5)) (name2-24 / 04/2015 15:31;观察(name2; 24/04/2015 15:31; 20.5))

My unique key exists of a name and a timestamp pasted together. 我的唯一键是将名称和时间戳粘贴在一起。 Then i have the name and the timestamp separate and then my double value. 然后我将名称和时间戳分开,然后将我的double值分开。 I will keep track of the exponential moving average for each different name. 我将跟踪每个不同名称的指数移动平均值。

I am doing this using updateStateByKey() which works perfectly: (name will be the key during this operation because I need the average per name) 我正在使用updateStateByKey()进行此操作,该方法非常有效:(名称将是此操作期间的关键,因为我需要每个名称的平均值)

case class Observation(name: String, time: Timestamp, outcome: Double)

val outcomeDstream: DStream[(String, Double)] = 
    parsedstream.map { case (k: String, obs: Observation) => (obs.name, obs.close) }

def updateEMA(newValues: Seq[Double],oldCount: Option[Double]): Option[Double] = {
  if (oldCount.isEmpty) newValues(0)
  else Some((newValues(0)*0.25) + (oldCount.get*(0.75)))
}

val ema = outcomeDstream.updateStateByKey[Double](updateEMA _)

The problem I have is this: If I use this function to keep track of my exponential moving average, it will return me:(name, expMovAvg). 我的问题是这样的:如果使用此函数跟踪指数移动平均值,它将返回我:(name,expMovAvg)。 But I will have lost my unique key and the timestamp. 但是我将丢失我的唯一密钥和时间戳。 The problem with this is that I can't join this ema-Dstream with my original stream because my key is now just the name which is not unique. 问题是我无法将此ema-Dstream与原始流一起加入,因为我的密钥现在只是名称,该名称不是唯一的。

Is there a possibility to keep the unique key or the timestamp during my updateStateByKey transformation? 在updateStateByKey转换过程中是否可以保留唯一键或时间戳?

If I understand your question properly, instead of keeping an Option[Double] in updateStateByKey as state, you can use an Option[Observation] as the state with the name as the key, which will contain all the unique data you need: 如果我正确理解了您的问题,则可以将Option[Observation]作为状态(以名称为键)来使用,而不updateStateByKey Option[Double] updateStateByKey Option[Double] updateStateByKey作为状态保留在状态中,该方法将包含您需要的所有唯一数据:

val outcomeDstream: DStream[(String, Observation)] = 
    parsedstream.map { case (k: String, obs: Observation) => (obs.name, obs) }

def updateEMA(newValues: Seq[Observation], 
              oldCount: Option[Observation]): Option[Observation] = {
  if (oldCount.isEmpty) newValues(0)
  else Some((newValues(0).outcome * 0.25) + (oldCount.get.outcome * (0.75)))
}

As a side note, if you're using Spark 1.6.0, consider looking into PairDStreamFunctions.mapWithState . 附带说明一下,如果您使用的是Spark 1.6.0,请考虑查看PairDStreamFunctions.mapWithState Although having a slightly different semantics (it won't process a key which hasn't received a new value) and still experimental, it is superior in performance . 尽管它的语义略有不同(它不会处理尚未获得新值的键)并且仍处于实验阶段, 但它的性能优越

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM