Spark Streaming Accumulated Word Count

Question

This is a spark streaming program written in scala. It counts the number of words from a socket in every 1 second. The result would be the word count, for example, the word count from time 0 to 1, and the word count then from time 1 to 2. But I wonder if there is some way we could alter this program so that we could get accumulated word count? That is, the word count from time 0 up till now.

val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(1))

// Create a socket stream on target ip:port and count the
// words in input stream of \n delimited text (eg. generated by 'nc')
// Note that no duplication in storage level only for running locally.
// Replication necessary in distributed scenario for fault tolerance.
val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()

Answer 1

You can use a StateDStream for this. There is an example of stateful word count from sparks examples .

object StatefulNetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()

    val updateFunc = (values: Seq[Int], state: Option[Int]) => {
      val currentCount = values.foldLeft(0)(_ + _)

      val previousCount = state.getOrElse(0)

      Some(currentCount + previousCount)
    }

    val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount")
    // Create the context with a 1 second batch size
    val ssc = new StreamingContext(sparkConf, Seconds(1))
    ssc.checkpoint(".")

    // Create a NetworkInputDStream on target ip:port and count the
    // words in input stream of \n delimited test (eg. generated by 'nc')
    val lines = ssc.socketTextStream(args(0), args(1).toInt)
    val words = lines.flatMap(_.split(" "))
    val wordDstream = words.map(x => (x, 1))

    // Update the cumulative count using updateStateByKey
    // This will give a Dstream made of state (which is the cumulative count of the words)
    val stateDstream = wordDstream.updateStateByKey[Int](updateFunc)
    stateDstream.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

The way it works is you get an Seq[T] for each batch, then you update an Option[T] which acts like an accumulator. The reason it is an Option is because on the first batch it will be None and stay that way unless it's updated. In this example the count is an int, if you are dealing with a lot of data you may want to even have a Long or BigInt

Answer 2

I have a very simple answer and its just few lines of code. you can find this is most of the spark books. remember that I have used localhost and port 9999.

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream("localhost", 9999)
counts = lines.flatMap(lambda line: line.split(" "))\
                     .map(lambda word: (word, 1))\
                     .reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()

and to stop you can use a simple

ssc.stop()

This is a very basic code but this code is helpful in a basic understanding of spark streaming, Dstream to be more specific.

to give input to the localhost in your terminal (Mac terminal) type

nc -l 9999

so it would listen to everything you type after that and the words would be counted

Hope this is helpful.

Spark Streaming Accumulated Word Count

Question

2 answers

solution1
11 ACCPTED 2014-07-16 03:48:24

solution2
-1 2020-08-09 16:23:03

Spark Streaming Accumulated Word Count

Question

2 answers

solution1 11 ACCPTED 2014-07-16 03:48:24

solution2 -1 2020-08-09 16:23:03

solution1
11 ACCPTED 2014-07-16 03:48:24

solution2
-1 2020-08-09 16:23:03