HDP 2.6.5 no kerb
I am running kafka and spark in cluster.
I am writing data to a particular topic in kafka and trying to run a python code for read and show data from kafka.
However, reading freezes and does not throw an error.
Starting pyspark:
pyspark --master yarn --num-executors 1 --executor-cores 4 --executor-memory 16G --driver-cores 4 --driver-memory 8G --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1
In pyspark shell:
from pyspark.sql import SparkSession, SQLContext, HiveContext
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
sqlcontext = SQLContext(spark.sparkContext)
hivecontext = HiveContext(spark.sparkContext)
hivecontext.setConf("hive.exec.dynamic.partition", "true")
hivecontext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
ds = spark.read.format("kafka").option("kafka.bootstrap.servers", "server-1:6667,server-2:6667").option("subscribe", "testtopic").option("startingOffsets", "earliest").option("endingOffsets", "latest").load()
ds.show()
When I read data on the server:
./kafka-run-class.sh kafka.tools.SimpleConsumerShell --broker-list server-1:6667,server-2:6667 --topic testtopic --partition 0
the data is in the topic.
I checked the availability of servers and ports from the server where spark is running using:
nc -zv server-1 2181
nc -zv server-1 6667
It's ok
Writing to the topic is made from one server, reading from another. All servers are in a cluster.
UPD. By a scientific method I found out: using the command on the Kafka server
kafka-console-consumer.sh --zookeeper server-1:2181 --topic testtopic --from-beginning
gives data.
Using the command
kafka-console-consumer.sh --bootstrap-server server-1:6667 --topic testtopic --from-beginning --partition 0
gives data.
But when I run the consumer on another server, it does not appear in the kafka consumer list
Once you have defined the final result DataFrame/Dataset, all that is left is for you to start the streaming computation. To do that, you have to use the DataStreamWriter (Scala/Java/Python docs) returned through Dataset.writeStream(). You will have to specify one or more of the following in this interface. Just try that:
ds.start()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.