简体   繁体   English

Spark Streaming groupByKey和updateStateByKey实现

[英]Spark Streaming groupByKey and updateStateByKey implementation

I am trying to run stateful Spark Streaming computations over (fake) apache web server logs read from Kafka. 我正在尝试对从Kafka读取的(假)apache web服务器日志运行有状态Spark Streaming计算。 The goal is to "sessionize" the web traffic similar to this blog post 目标是“会话化”类似于此博客文章的网络流量

The only difference is that I want to "sessionize" each page the IP hits, instead of the entire session. 唯一的区别是我希望“会话化”IP命中的每个页面,而不是整个会话。 I was able to do this reading from a file of fake web traffic using Spark in batch mode, but now I want to do it in a streaming context. 我能够在批处理模式下使用Spark从假网络流量文件中读取此内容,但现在我想在流式上下文中执行此操作。

Log files are read from Kafka and parsed into K/V pairs of (String, (String, Long, Long)) or 日志文件从Kafka读取并解析为K/V(String, (String, Long, Long))

(IP, (requestPage, time, time)) . (IP, (requestPage, time, time))

I then call groupByKey() on this K/V pair . 然后我在这个K/V pair上调用groupByKey() In batch mode, this would produce a: 在批处理模式下,这将产生:

(String, CollectionBuffer((String, Long, Long), ...) or (String, CollectionBuffer((String, Long, Long), ...)

(IP, CollectionBuffer((requestPage, time, time), ...)

In a StreamingContext, it produces a: 在StreamingContext中,它产生一个:

(String, ArrayBuffer((String, Long, Long), ...) like so: (String, ArrayBuffer((String, Long, Long), ...)像这样:

(183.196.254.131,ArrayBuffer((/test.php,1418849762000,1418849762000)))

However, as the next microbatch (DStream) arrives, this information is discarded. 但是,随着下一个微分类(DStream)的到来,该信息被丢弃。

Ultimately what I want is for that ArrayBuffer to fill up over time as a given IP continues to interact and to run some computations on its data to "sessionize" the page time. 最终我想要的是,随着给定IP继续交互并对其数据运行一些计算以“会话化”页面时间, ArrayBuffer随着时间的推移而填满。

I believe the operator to make that happen is " updateStateByKey ." 我相信发生这种情况的运营商是“ updateStateByKey I'm having some trouble with this operator (I'm new to both Spark & Scala); 我在使用这个操作符时遇到了一些麻烦(我是Spark和Scala的新手);

any help is appreciated. 任何帮助表示赞赏。

Thus far: 迄今:

val grouped = ipTimeStamp.groupByKey().updateStateByKey(updateGroupByKey) 


    def updateGroupByKey(
                          a: Seq[(String, ArrayBuffer[(String, Long, Long)])],
                          b: Option[(String, ArrayBuffer[(String, Long, Long)])]
                          ): Option[(String, ArrayBuffer[(String, Long, Long)])] = {

  }

I think you are looking for something like this: 我想你正在寻找这样的东西:

 def updateGroupByKey(
                          newValues: Seq[(String, ArrayBuffer[(String, Long, Long)])],
                          currentValue: Option[(String, ArrayBuffer[(String, Long, Long)])]
                          ): Option[(String, ArrayBuffer[(String, Long, Long)])] = {
     //Collect the values
     val buffs: Seq[ArrayBuffer[(String, Long, Long)]] = (for (v <- newValues) yield v._2)
     val buffs2 = if (currentValue.isEmpty) buffs else currentValue.get._2 :: buffs
     //Convert state to buffer
     if (buffs2.isEmpty) None else {
        val key = if (currentValue.isEmpty) newValues(0)._1 else currentValue.get._1
        Some((key, buffs2.foldLeft(new ArrayBuffer[(String, Long, Long)])((v, a) => v++a)))
     }
  }

Gabor's answer got me started down the right path, but here is an answer that produces the expected output. Gabor的回答让我开始走正确的道路,但这是一个产生预期输出的答案。

First, for the output I want: 首先,对于我想要的输出:

(100.40.49.235,List((/,1418934075000,1418934075000), (/,1418934105000,1418934105000), (/contactus.html,1418934174000,1418934174000)))

I don't need groupByKey() . 我不需要groupByKey() updateStateByKey already accumulates the values into a Seq, so the addition of groupByKey is unnecessary (and expensive). updateStateByKey已经将值累积到Seq中,因此添加groupByKey是不必要的(而且很昂贵)。 Spark users strongly suggest not using groupByKey . Spark用户强烈建议不要使用groupByKey

Here is the code that worked: 这是有效的代码:

def updateValues( newValues: Seq[(String, Long, Long)],
                      currentValue: Option[Seq[ (String, Long, Long)]]
                      ): Option[Seq[(String, Long, Long)]] = {

  Some(currentValue.getOrElse(Seq.empty) ++ newValues)

  }


val grouped = ipTimeStamp.updateStateByKey(updateValues)

Here updateStateByKey is passed a function (updateValues) that has the accumulation of values over time (newValues) as well as an option for the current value in the stream (currentValue). 这里updateStateByKey传递一个函数(updateValues),它具有随时间累积的值(newValues)以及流中当前值的选项(currentValue)。 It then returns the combination of these. 然后它返回这些的组合。 getOrElse is required as currentValue may occasionally be empty. getOrElse是必需的,因为currentValue有时可能为空。 Credit to https://twitter.com/granturing for the correct code. 感谢https://twitter.com/granturing提供正确的代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM