![](/img/trans.png)
[英]Not able to connect to kafka topic using spark streaming (python, jupyter)
[英]Reading a Kafka topic using streaming api in Pyspark - Issue not able to write to console or send to any other sink
我有點卡住了,我在 pyspark 中使用 Spark 流 API 閱讀了一個 Kafka 主題,然后我嘗試將它推送到控制台或另一個 Kafka 主題的接收器,我不確定我做錯了什么,但整個過程是有點卡住,什么也不做。 我檢查了有關該主題的消息等。如果我使用基於 Java 的消費者,我可以閱讀並繼續,但不知何故 pyspark 無法使用和輸出消息。 我也把代碼放在 zeppelin notebook 中,代碼如下。 如果有人可以請快速查看並建議我做錯了什么,不勝感激
%pyspark
def foreach_function(df, epoch_id):
print("I am here")
#pass
from pyspark.sql.types import StructType,StructField, TimestampType, StringType, IntegerType, DoubleType
from pyspark.sql.functions import *
schema = StructType([
StructField("orderId",StringType(),True),
StructField("quantity",IntegerType(),True),
StructField("order_VALUE",DoubleType(),True),
StructField("sku",StringType(),True),
StructField("sales_DATE",StringType(),True)
])
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "kafka.topic.orders") \
.option("startingOffsets", "latest") \
.load()
df.printSchema()
dataDF = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
dataDF.printSchema()
orderDF = dataDF.select(from_json(col("value"),schema)).alias("data").select("data.*")
orderDF.printSchema()
orderDF.writeStream.outputMode("append").format("console").option("checkpointLocation", "/test/chkpt").start().awaitTermination()
Error
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
root
|-- key: string (nullable = true)
|-- value: string (nullable = true)
root
|-- from_json(value): struct (nullable = true)
| |-- orderId: string (nullable = true)
| |-- quantity: integer (nullable = true)
| |-- order_VALUE: double (nullable = true)
| |-- sku: string (nullable = true)
| |-- sales_DATE: string (nullable = true)
Fail to execute line 31: orderDF.writeStream.outputMode("append").format("console").option("checkpointLocation", "/test/chkpt").start().awaitTermination()
Traceback (most recent call last):
File "/tmp/1625186594615-0/zeppelin_python.py", line 158, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 31, in <module>
File "/Users/test/software/spark-3.0.0-bin-hadoop2.7/python/pyspark/sql/streaming.py", line 103, in awaitTermination
return self._jsq.awaitTermination()
File "/Users/test/software/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/Users/test/software/spark-3.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 137, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.StreamingQueryException: Writing job aborted.
=== Streaming Query ===
Identifier: [id = 85d72b5f-f1f5-4ad3-a8b4-cb986576ced2, runId = 229fed09-0c60-4eae-a296-7fbebb46f4d6]
Current Committed Offsets: {}
Current Available Offsets: {KafkaV2[Subscribe[kafka.topic.orders]]: {"kafka.topic.orders":{"2":34,"1":31,"0":35}}}
也許先看看這是否有幫助。 來自 Kafka 的 pySpark 結構化流不會輸出到控制台進行調試
我會嘗試這樣做:
writeStream
.format("console")
.start().awaitTermination()
還要仔細檢查在您啟動消費者(在 pyspark 作業上方)后是否生成消息,因為您有“最新”標志。
我能夠使它工作,我使用 spark-submit 在 zeppelin 之外嘗試了腳本,並且在我添加了 commons-pool2 jar 之后它工作了。 我能夠使用原生 spark-submit 看到完整的堆棧跟蹤。 謝謝你們。 org.apache.commons commons-pool2 2.10.0 –
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.