简体   繁体   English

Spark Kafka WordCount Python

[英]Spark Kafka WordCount Python

I've just started playing with apache spark and trying to get the kafka wordcount to work in python. 我刚刚开始使用Apache Spark,并尝试让kafka wordcount在python中工作。 I've decided to use python as its a language I'll be able to use for other big data tech and also DataBricks are offering their courses through spark. 我已经决定使用python作为它的语言,我可以将其用于其他大数据技术,而且DataBricks也通过spark提供课程。

My question: I'm running the basic wordcount example from here: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/kafka_wordcount.py It seems to kick off and connect to the kafka logs but I can't see it actually produce a word count. 我的问题:我正在从这里运行基本的单词计数示例: https : //github.com/apache/spark/blob/master/examples/src/main/python/streaming/kafka_wordcount.py似乎开始并连接了到kafka日志,但我看不到它实际上产生了字数统计。 I then added the below lines to write to a text file and it just produces a bunch of empty text file. 然后,我添加了以下几行内容以写入文本文件,它仅产生一堆空文本文件。 It is connecting to the kafka topic and there is data in the topic, how can I see what its actually doing with the data if anything? 它正在连接到kafka主题,并且该主题中有数据,如何查看该数据对数据的实际作用? Could it be a timing thing? 可能是时间安排吗? Cheers. 干杯。

Code for processing kafka data 用于处理kafka数据的代码

                counts = lines.flatMap(lambda line: line.split("|")) \
                    .map(lambda word: (word, 1)) \
                    .reduceByKey(lambda a, b: a+b) \
                    .saveAsTextFiles("sparkfiles")

Data in Kafka topic Kafka主题中的数据

                    16|16|Mr|Joe|T|Bloggs

Sorry, I was being an idiot. 抱歉,我是个白痴。 When I produced data to the topic while the spark app was running I can see the following in the output 当我在spark应用运行时为主题生成数据时,我可以在输出中看到以下内容

                (u'a', 29)
                (u'count', 29)
                (u'This', 29)
                (u'is', 29)
                (u'so', 29)
                (u'words', 29)
                (u'spark', 29)
                (u'the', 29)
                (u'can', 29)
                (u'sentence', 29)

This represents how many times each word was represented in the block that was just processed by spark. 这表示每个单词在刚刚由spark处理的块中表示了多少次。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM