Not able to connect to kafka topic using spark streaming (python, jupyter)

Question

I tried to connect to kafka topic using spark. It's not reading any data in its dstream or giving any error. Here is my jupyter code:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 pyspark-shell'
from pretty import pprint
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json

sc = SparkContext(appName="PythonSparkStreamingKafka_RM_01")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)

kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 'spark-streaming', {'topic_name':1})
kafkaStream.pprint()

Nothing gets printed. Also tried with createDirectStream but didn't get any output. Followed Spark Streaming not reading from Kafka topics and added PYTHONPATH but it didn't help either.

Any help would be deeply appreciated. Thanks!

Answer 1

It's not clear if you are sending any data., but you're not actually starting consumption

You'll need this at the end

ssc.start() 
ssc.awaitTermination()

You need to add auto.offset.reset" : "smallest" in the createStream properties to read existing topic data.

from pyspark.streaming.kafka import KafkaUtils

directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {"bootstrap-servers": brokers, "auto.offset.reset" : "smallest"})

Answer 2

As cricket_007 mentioned Structured Streaming is generally preferred. If you still want to handle it with directStream method sample as in below .

Note : Trying to read the message from topic 'topicname' and rewriting into another topic called 'compacttopic'

from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
from kafka import SimpleProducer, KafkaClient
from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers='localhost:9092')

def handler(message):
    records = message.collect()
    for record in records:
        value_all=record[1]
        value_spt=value_all.split('|')
        value_key=value_spt[0]
        print (value_key)
        producer.send('compacttopic', key=str(value_key),value=str(record[1]))
        producer.flush()

def main():
    sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
    ssc = StreamingContext(sc, 10)

    brokers, topic = sys.argv[1:]
    kvs = KafkaUtils.createDirectStream(ssc, ['topicname'], {"metadata.broker.list": 'localhost:9092'})
    kvs.foreachRDD(handler)

    ssc.start()
    ssc.awaitTermination()
if __name__ == "__main__":

   main()

spark-submit command :

 ./bin/spark-submit --jars /Users/KarthikeyanDurairaj/jarfiles/spark-streaming-kafka-0-8-assembly_2.11-2.3.1.jar topictotopic.py localhost:9092 topicname

Note : Adjust the jar version based on your spark installed version .

Structured Streaming Approach :

You can refer the below stack overflow link for pyspark based Structured Streaming.

Failed to find leader for topics; java.lang.NullPointerException NullPointerException at org.apache.kafka.common.utils.Utils.formatAddress

Not able to connect to kafka topic using spark streaming (python, jupyter)

Question

2 answers

solution1
0 2019-12-18 23:44:03

solution2
0 2019-12-26 16:34:05

Not able to connect to kafka topic using spark streaming (python, jupyter)

Question

2 answers

solution1 0 2019-12-18 23:44:03

solution2 0 2019-12-26 16:34:05

solution1
0 2019-12-18 23:44:03

solution2
0 2019-12-26 16:34:05