简体   繁体   中英

Spark Kafka WordCount Python

I've just started playing with apache spark and trying to get the kafka wordcount to work in python. I've decided to use python as its a language I'll be able to use for other big data tech and also DataBricks are offering their courses through spark.

My question: I'm running the basic wordcount example from here: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/kafka_wordcount.py It seems to kick off and connect to the kafka logs but I can't see it actually produce a word count. I then added the below lines to write to a text file and it just produces a bunch of empty text file. It is connecting to the kafka topic and there is data in the topic, how can I see what its actually doing with the data if anything? Could it be a timing thing? Cheers.

Code for processing kafka data

                counts = lines.flatMap(lambda line: line.split("|")) \
                    .map(lambda word: (word, 1)) \
                    .reduceByKey(lambda a, b: a+b) \
                    .saveAsTextFiles("sparkfiles")

Data in Kafka topic

                    16|16|Mr|Joe|T|Bloggs

Sorry, I was being an idiot. When I produced data to the topic while the spark app was running I can see the following in the output

                (u'a', 29)
                (u'count', 29)
                (u'This', 29)
                (u'is', 29)
                (u'so', 29)
                (u'words', 29)
                (u'spark', 29)
                (u'the', 29)
                (u'can', 29)
                (u'sentence', 29)

This represents how many times each word was represented in the block that was just processed by spark.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM