在 Pyspark 中使用流式 API 讀取 Kafka 主題 - 無法寫入控制台或發送到任何其他接收器的問題

Question

我有點卡住了，我在 pyspark 中使用 Spark 流 API 閱讀了一個 Kafka 主題，然后我嘗試將它推送到控制台或另一個 Kafka 主題的接收器，我不確定我做錯了什么，但整個過程是有點卡住，什么也不做。 我檢查了有關該主題的消息等。如果我使用基於 Java 的消費者，我可以閱讀並繼續，但不知何故 pyspark 無法使用和輸出消息。 我也把代碼放在 zeppelin notebook 中，代碼如下。 如果有人可以請快速查看並建議我做錯了什么，不勝感激

%pyspark

def foreach_function(df, epoch_id):
    print("I am here")
    #pass
  

from pyspark.sql.types import StructType,StructField, TimestampType, StringType, IntegerType, DoubleType
from pyspark.sql.functions import *
schema = StructType([
    StructField("orderId",StringType(),True), 
    StructField("quantity",IntegerType(),True), 
    StructField("order_VALUE",DoubleType(),True), 
    StructField("sku",StringType(),True), 
    StructField("sales_DATE",StringType(),True)
    ])
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "kafka.topic.orders") \
  .option("startingOffsets", "latest") \
  .load()

df.printSchema()
dataDF = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
dataDF.printSchema()
orderDF = dataDF.select(from_json(col("value"),schema)).alias("data").select("data.*")
orderDF.printSchema()

orderDF.writeStream.outputMode("append").format("console").option("checkpointLocation", "/test/chkpt").start().awaitTermination()



Error
root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)

root
 |-- key: string (nullable = true)
 |-- value: string (nullable = true)

root
 |-- from_json(value): struct (nullable = true)
 |    |-- orderId: string (nullable = true)
 |    |-- quantity: integer (nullable = true)
 |    |-- order_VALUE: double (nullable = true)
 |    |-- sku: string (nullable = true)
 |    |-- sales_DATE: string (nullable = true)

Fail to execute line 31: orderDF.writeStream.outputMode("append").format("console").option("checkpointLocation", "/test/chkpt").start().awaitTermination()
Traceback (most recent call last):
  File "/tmp/1625186594615-0/zeppelin_python.py", line 158, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 31, in <module>
  File "/Users/test/software/spark-3.0.0-bin-hadoop2.7/python/pyspark/sql/streaming.py", line 103, in awaitTermination
    return self._jsq.awaitTermination()
  File "/Users/test/software/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/Users/test/software/spark-3.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 137, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.StreamingQueryException: Writing job aborted.
=== Streaming Query ===
Identifier: [id = 85d72b5f-f1f5-4ad3-a8b4-cb986576ced2, runId = 229fed09-0c60-4eae-a296-7fbebb46f4d6]
Current Committed Offsets: {}
Current Available Offsets: {KafkaV2[Subscribe[kafka.topic.orders]]: {"kafka.topic.orders":{"2":34,"1":31,"0":35}}}

Answer 1

也許先看看這是否有幫助。 來自 Kafka 的 pySpark 結構化流不會輸出到控制台進行調試

我會嘗試這樣做：

writeStream
    .format("console")
    .start().awaitTermination()

還要仔細檢查在您啟動消費者（在 pyspark 作業上方）后是否生成消息，因為您有“最新”標志。

Answer 2

我能夠使它工作，我使用 spark-submit 在 zeppelin 之外嘗試了腳本，並且在我添加了 commons-pool2 jar 之后它工作了。 我能夠使用原生 spark-submit 看到完整的堆棧跟蹤。 謝謝你們。 org.apache.commons commons-pool2 2.10.0 –

在 Pyspark 中使用流式 API 讀取 Kafka 主題 - 無法寫入控制台或發送到任何其他接收器的問題

問題描述

2 個解決方案

解決方案1
1 2021-07-02 01:45:37

解決方案2
0 2021-07-02 03:46:03

在 Pyspark 中使用流式 API 讀取 Kafka 主題 - 無法寫入控制台或發送到任何其他接收器的問題

問題描述

2 個解決方案

解決方案1 1 2021-07-02 01:45:37

解決方案2 0 2021-07-02 03:46:03

解決方案1
1 2021-07-02 01:45:37

解決方案2
0 2021-07-02 03:46:03