Spark Streaming groupByKey和updateStateByKey實現

Question

我正在嘗試對從Kafka讀取的（假）apache web服務器日志運行有狀態Spark Streaming計算。 目標是“會話化”類似於此博客文章的網絡流量

唯一的區別是我希望“會話化”IP命中的每個頁面，而不是整個會話。 我能夠在批處理模式下使用Spark從假網絡流量文件中讀取此內容，但現在我想在流式上下文中執行此操作。

日志文件從Kafka讀取並解析為K/V對(String, (String, Long, Long))或

(IP, (requestPage, time, time)) 。

然后我在這個K/V pair上調用groupByKey() 。 在批處理模式下，這將產生：

(String, CollectionBuffer((String, Long, Long), ...)或

(IP, CollectionBuffer((requestPage, time, time), ...)

在StreamingContext中，它產生一個：

(String, ArrayBuffer((String, Long, Long), ...)像這樣：

(183.196.254.131,ArrayBuffer((/test.php,1418849762000,1418849762000)))

但是，隨着下一個微分類（DStream）的到來，該信息被丟棄。

最終我想要的是，隨着給定IP繼續交互並對其數據運行一些計算以“會話化”頁面時間， ArrayBuffer隨着時間的推移而填滿。

我相信發生這種情況的運營商是“ updateStateByKey 。 我在使用這個操作符時遇到了一些麻煩（我是Spark和Scala的新手）;

任何幫助表示贊賞。

迄今：

val grouped = ipTimeStamp.groupByKey().updateStateByKey(updateGroupByKey) 


    def updateGroupByKey(
                          a: Seq[(String, ArrayBuffer[(String, Long, Long)])],
                          b: Option[(String, ArrayBuffer[(String, Long, Long)])]
                          ): Option[(String, ArrayBuffer[(String, Long, Long)])] = {

  }

Answer 1

我想你正在尋找這樣的東西：

 def updateGroupByKey(
                          newValues: Seq[(String, ArrayBuffer[(String, Long, Long)])],
                          currentValue: Option[(String, ArrayBuffer[(String, Long, Long)])]
                          ): Option[(String, ArrayBuffer[(String, Long, Long)])] = {
     //Collect the values
     val buffs: Seq[ArrayBuffer[(String, Long, Long)]] = (for (v <- newValues) yield v._2)
     val buffs2 = if (currentValue.isEmpty) buffs else currentValue.get._2 :: buffs
     //Convert state to buffer
     if (buffs2.isEmpty) None else {
        val key = if (currentValue.isEmpty) newValues(0)._1 else currentValue.get._1
        Some((key, buffs2.foldLeft(new ArrayBuffer[(String, Long, Long)])((v, a) => v++a)))
     }
  }

Answer 2

Gabor的回答讓我開始走正確的道路，但這是一個產生預期輸出的答案。

首先，對於我想要的輸出：

(100.40.49.235,List((/,1418934075000,1418934075000), (/,1418934105000,1418934105000), (/contactus.html,1418934174000,1418934174000)))

我不需要groupByKey() 。 updateStateByKey已經將值累積到Seq中，因此添加groupByKey是不必要的（而且很昂貴）。 Spark用戶強烈建議不要使用groupByKey 。

這是有效的代碼：

def updateValues( newValues: Seq[(String, Long, Long)],
                      currentValue: Option[Seq[ (String, Long, Long)]]
                      ): Option[Seq[(String, Long, Long)]] = {

  Some(currentValue.getOrElse(Seq.empty) ++ newValues)

  }


val grouped = ipTimeStamp.updateStateByKey(updateValues)

這里updateStateByKey傳遞一個函數（updateValues），它具有隨時間累積的值（newValues）以及流中當前值的選項（currentValue）。 然后它返回這些的組合。 getOrElse是必需的，因為currentValue有時可能為空。 感謝https://twitter.com/granturing提供正確的代碼。

Spark Streaming groupByKey和updateStateByKey實現

問題描述

2 個解決方案

解決方案1
2 2014-12-17 22:34:18

解決方案2
2 已采納 2014-12-18 20:45:36

Spark Streaming groupByKey和updateStateByKey實現

問題描述

2 個解決方案

解決方案1 2 2014-12-17 22:34:18

解決方案2 2 已采納 2014-12-18 20:45:36

解決方案1
2 2014-12-17 22:34:18

解決方案2
2 已采納 2014-12-18 20:45:36