正常写入 apache-kafka 但无法读取 Spark 作业中的主题数据

Question

HDP 2.6.5 no kerb HDP 2.6.5 无路缘

I am running kafka and spark in cluster.我在集群中运行 kafka 和 spark。

I am writing data to a particular topic in kafka and trying to run a python code for read and show data from kafka.我正在向 kafka 中的特定主题写入数据，并尝试运行 python 代码以读取和显示来自 kafka 的数据。

However, reading freezes and does not throw an error.但是，读取会冻结并且不会引发错误。

Starting pyspark:启动 pyspark：

pyspark --master yarn --num-executors 1 --executor-cores 4 --executor-memory 16G --driver-cores 4 --driver-memory 8G --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1

In pyspark shell:在 pyspark shell 中：

from pyspark.sql import SparkSession, SQLContext, HiveContext
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
sqlcontext = SQLContext(spark.sparkContext)
hivecontext = HiveContext(spark.sparkContext)
hivecontext.setConf("hive.exec.dynamic.partition", "true")
hivecontext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

ds = spark.read.format("kafka").option("kafka.bootstrap.servers", "server-1:6667,server-2:6667").option("subscribe", "testtopic").option("startingOffsets", "earliest").option("endingOffsets", "latest").load()
ds.show()

When I read data on the server:当我在服务器上读取数据时：

./kafka-run-class.sh kafka.tools.SimpleConsumerShell --broker-list server-1:6667,server-2:6667 --topic testtopic --partition 0

the data is in the topic.数据在主题中。

I checked the availability of servers and ports from the server where spark is running using:我检查了运行 spark 的服务器的服务器和端口的可用性：

nc -zv server-1 2181
nc -zv server-1 6667

It's ok没关系

Writing to the topic is made from one server, reading from another.从一台服务器写入主题，从另一台服务器读取。 All servers are in a cluster.所有服务器都在一个集群中。

UPD. UPD。 By a scientific method I found out: using the command on the Kafka server通过科学的方法我发现：使用Kafka服务器上的命令

kafka-console-consumer.sh --zookeeper server-1:2181 --topic testtopic --from-beginning

gives data.给出数据。

Using the command使用命令

kafka-console-consumer.sh --bootstrap-server server-1:6667 --topic testtopic --from-beginning --partition 0

gives data.给出数据。

But when I run the consumer on another server, it does not appear in the kafka consumer list但是当我在另一台服务器上运行消费者时，它并没有出现在 kafka 消费者列表中

Answer 1

Once you have defined the final result DataFrame/Dataset, all that is left is for you to start the streaming computation.一旦定义了最终结果 DataFrame/Dataset，剩下的就是开始流式计算了。 To do that, you have to use the DataStreamWriter (Scala/Java/Python docs) returned through Dataset.writeStream().为此，您必须使用通过 Dataset.writeStream() 返回的 DataStreamWriter（Scala/Java/Python 文档）。 You will have to specify one or more of the following in this interface.您必须在此界面中指定以下一项或多项。 Just try that:试试看：

ds.start()

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#starting-streaming-queries https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#starting-streaming-queries

正常写入 apache-kafka 但无法读取 Spark 作业中的主题数据

问题描述

1 个解决方案

解决方案1
0 2020-05-17 21:16:40

正常写入 apache-kafka 但无法读取 Spark 作业中的主题数据

问题描述

1 个解决方案

解决方案1 0 2020-05-17 21:16:40

解决方案1
0 2020-05-17 21:16:40