正常寫入 apache-kafka 但無法讀取 Spark 作業中的主題數據

Question

HDP 2.6.5 無路緣

我在集群中運行 kafka 和 spark。

我正在向 kafka 中的特定主題寫入數據，並嘗試運行 python 代碼以讀取和顯示來自 kafka 的數據。

但是，讀取會凍結並且不會引發錯誤。

啟動 pyspark：

pyspark --master yarn --num-executors 1 --executor-cores 4 --executor-memory 16G --driver-cores 4 --driver-memory 8G --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1

在 pyspark shell 中：

from pyspark.sql import SparkSession, SQLContext, HiveContext
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
sqlcontext = SQLContext(spark.sparkContext)
hivecontext = HiveContext(spark.sparkContext)
hivecontext.setConf("hive.exec.dynamic.partition", "true")
hivecontext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

ds = spark.read.format("kafka").option("kafka.bootstrap.servers", "server-1:6667,server-2:6667").option("subscribe", "testtopic").option("startingOffsets", "earliest").option("endingOffsets", "latest").load()
ds.show()

當我在服務器上讀取數據時：

./kafka-run-class.sh kafka.tools.SimpleConsumerShell --broker-list server-1:6667,server-2:6667 --topic testtopic --partition 0

數據在主題中。

我檢查了運行 spark 的服務器的服務器和端口的可用性：

nc -zv server-1 2181
nc -zv server-1 6667

沒關系

從一台服務器寫入主題，從另一台服務器讀取。 所有服務器都在一個集群中。

UPD。 通過科學的方法我發現：使用Kafka服務器上的命令

kafka-console-consumer.sh --zookeeper server-1:2181 --topic testtopic --from-beginning

給出數據。

使用命令

kafka-console-consumer.sh --bootstrap-server server-1:6667 --topic testtopic --from-beginning --partition 0

給出數據。

但是當我在另一台服務器上運行消費者時，它並沒有出現在 kafka 消費者列表中

Answer 1

一旦定義了最終結果 DataFrame/Dataset，剩下的就是開始流式計算了。 為此，您必須使用通過 Dataset.writeStream() 返回的 DataStreamWriter（Scala/Java/Python 文檔）。 您必須在此界面中指定以下一項或多項。 試試看：

ds.start()

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#starting-streaming-queries

正常寫入 apache-kafka 但無法讀取 Spark 作業中的主題數據

問題描述

1 個解決方案

解決方案1
0 2020-05-17 21:16:40

正常寫入 apache-kafka 但無法讀取 Spark 作業中的主題數據

問題描述

1 個解決方案

解決方案1 0 2020-05-17 21:16:40

解決方案1
0 2020-05-17 21:16:40