Spark Streaming 累計字數

Question

這是一個用scala編寫的spark流程序。 它每 1 秒計算一次套接字中的單詞數。 結果將是字數，例如，從時間 0 到 1 的字數，然后從時間 1 到 2 的字數。但我想知道是否有什么方法可以改變這個程序，以便我們可以累積字數？ 也就是說，從時間 0 到現在的單詞計數。

val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(1))

// Create a socket stream on target ip:port and count the
// words in input stream of \n delimited text (eg. generated by 'nc')
// Note that no duplication in storage level only for running locally.
// Replication necessary in distributed scenario for fault tolerance.
val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()

Answer 1

您可以為此使用StateDStream 。 sparks 示例中有一個有狀態字數統計的示例。

object StatefulNetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()

    val updateFunc = (values: Seq[Int], state: Option[Int]) => {
      val currentCount = values.foldLeft(0)(_ + _)

      val previousCount = state.getOrElse(0)

      Some(currentCount + previousCount)
    }

    val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount")
    // Create the context with a 1 second batch size
    val ssc = new StreamingContext(sparkConf, Seconds(1))
    ssc.checkpoint(".")

    // Create a NetworkInputDStream on target ip:port and count the
    // words in input stream of \n delimited test (eg. generated by 'nc')
    val lines = ssc.socketTextStream(args(0), args(1).toInt)
    val words = lines.flatMap(_.split(" "))
    val wordDstream = words.map(x => (x, 1))

    // Update the cumulative count using updateStateByKey
    // This will give a Dstream made of state (which is the cumulative count of the words)
    val stateDstream = wordDstream.updateStateByKey[Int](updateFunc)
    stateDstream.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

它的工作方式是為每個批次獲得一個Seq[T] ，然后更新一個Option[T] ，它就像一個累加器。 它是一個Option的原因是因為在第一批中它將是None並保持這種狀態，除非它被更新。 在這個例子中，計數是一個整數，如果你正在處理大量數據，你甚至可能想要一個Long或BigInt

Answer 2

我有一個非常簡單的答案，只有幾行代碼。 你會發現這是大部分的火花書。 請記住，我使用了 localhost 和端口 9999。

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream("localhost", 9999)
counts = lines.flatMap(lambda line: line.split(" "))\
                     .map(lambda word: (word, 1))\
                     .reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()

並停止你可以使用一個簡單的

ssc.stop()

這是一個非常基本的代碼，但這段代碼有助於對火花流的基本理解，更具體地說是 Dstream。

在您的終端（Mac 終端）類型中向本地主機提供輸入

nc -l 9999

所以它會聽你之后輸入的所有內容，並且會計算字數

希望這是有幫助的。

Spark Streaming 累計字數

問題描述

2 個解決方案

解決方案1
11 已采納 2014-07-16 03:48:24

解決方案2
-1 2020-08-09 16:23:03

Spark Streaming 累計字數

問題描述

2 個解決方案

解決方案1 11 已采納 2014-07-16 03:48:24

解決方案2 -1 2020-08-09 16:23:03

解決方案1
11 已采納 2014-07-16 03:48:24

解決方案2
-1 2020-08-09 16:23:03