[英]Spark Streaming Accumulated Word Count
This is a spark streaming program written in scala.这是一个用scala编写的spark流程序。 It counts the number of words from a socket in every 1 second.
它每 1 秒计算一次套接字中的单词数。 The result would be the word count, for example, the word count from time 0 to 1, and the word count then from time 1 to 2. But I wonder if there is some way we could alter this program so that we could get accumulated word count?
结果将是字数,例如,从时间 0 到 1 的字数,然后从时间 1 到 2 的字数。但我想知道是否有什么方法可以改变这个程序,以便我们可以累积字数? That is, the word count from time 0 up till now.
也就是说,从时间 0 到现在的单词计数。
val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(1))
// Create a socket stream on target ip:port and count the
// words in input stream of \n delimited text (eg. generated by 'nc')
// Note that no duplication in storage level only for running locally.
// Replication necessary in distributed scenario for fault tolerance.
val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
You can use a StateDStream
for this.您可以为此使用
StateDStream
。 There is an example of stateful word count from sparks examples . sparks 示例中有一个有状态字数统计的示例。
object StatefulNetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount")
// Create the context with a 1 second batch size
val ssc = new StreamingContext(sparkConf, Seconds(1))
ssc.checkpoint(".")
// Create a NetworkInputDStream on target ip:port and count the
// words in input stream of \n delimited test (eg. generated by 'nc')
val lines = ssc.socketTextStream(args(0), args(1).toInt)
val words = lines.flatMap(_.split(" "))
val wordDstream = words.map(x => (x, 1))
// Update the cumulative count using updateStateByKey
// This will give a Dstream made of state (which is the cumulative count of the words)
val stateDstream = wordDstream.updateStateByKey[Int](updateFunc)
stateDstream.print()
ssc.start()
ssc.awaitTermination()
}
}
The way it works is you get an Seq[T]
for each batch, then you update an Option[T]
which acts like an accumulator.它的工作方式是为每个批次获得一个
Seq[T]
,然后更新一个Option[T]
,它就像一个累加器。 The reason it is an Option
is because on the first batch it will be None
and stay that way unless it's updated.它是一个
Option
的原因是因为在第一批中它将是None
并保持这种状态,除非它被更新。 In this example the count is an int, if you are dealing with a lot of data you may want to even have a Long
or BigInt
在这个例子中,计数是一个整数,如果你正在处理大量数据,你甚至可能想要一个
Long
或BigInt
I have a very simple answer and its just few lines of code.我有一个非常简单的答案,只有几行代码。 you can find this is most of the spark books.
你会发现这是大部分的火花书。 remember that I have used localhost and port 9999.
请记住,我使用了 localhost 和端口 9999。
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream("localhost", 9999)
counts = lines.flatMap(lambda line: line.split(" "))\
.map(lambda word: (word, 1))\
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
and to stop you can use a simple并停止你可以使用一个简单的
ssc.stop()
This is a very basic code but this code is helpful in a basic understanding of spark streaming, Dstream to be more specific.这是一个非常基本的代码,但这段代码有助于对火花流的基本理解,更具体地说是 Dstream。
to give input to the localhost in your terminal (Mac terminal) type在您的终端(Mac 终端)类型中向本地主机提供输入
nc -l 9999
so it would listen to everything you type after that and the words would be counted所以它会听你之后输入的所有内容,并且会计算字数
Hope this is helpful.希望这是有帮助的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.