java.lang.ClassNotFoundException: org.apache.spark.sql.kafka010.KafkaSourceRDDPartition

Question

我正在使用 Spark2.3.0 和 kafka1.0.0.3。 我創建了一個 spark read stream

df = spark.readStream. \
        format("kafka"). \
        option("kafka.bootstrap.servers", "localhost.cluster.com:6667"). \
        option("subscribe", "test_topic"). \
        load(). \
        selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "CAST(timestamp as TIMESTAMP)")

它運行成功然后

df_write = df_read2 \
        .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)",  "CAST(timestamp as TIMESTAMP)") \
        .writeStream \
        .format("csv") \
        .option("path", "/test_streaming_data") \
        .option("checkpointLocation", "test_streaming_data/checkpoint") \
        .start()

但是當我運行這個

df_write.awaitTermination()

它拋出一個錯誤：

    Py4JJavaError: An error occurred while calling o264.awaitTermination.
: org.apache.spark.sql.streaming.StreamingQueryException: Job aborted.
=== Streaming Query ===
Identifier: [id = c140e21c-f827-4b1d-9182-b3f68a405fad, runId = 47d4b5cb-f223-4235-bef1-84871a2f85c8]
Current Committed Offsets: {}
Current Available Offsets: {KafkaSource[Subscribe[test_topic]]: {"test_topic":{"0":31300}}}

Current State: ACTIVE
Thread State: RUNNABLE

Logical Plan:
Project [cast(key#21 as string) AS key#124, cast(value#22 as string) AS value#125, cast(timestamp#23 as timestamp) AS timestamp#126]
+- Project [cast(key#7 as string) AS key#21, cast(value#8 as string) AS value#22, cast(timestamp#12 as timestamp) AS timestamp#23]
   +- StreamingExecutionRelation KafkaSource[Subscribe[test_topic]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]

    at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
    at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:224)
    at org.apache.spark.sql.execution.streaming.FileStreamSink.addBatch(FileStreamSink.scala:131)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:475)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:473)
    at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:472)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
    at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
    at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
    at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
    ... 1 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 11, localhost.cluster.com, executor 2): java.lang.ClassNotFoundException: org.apache.spark.sql.kafka010.KafkaSourceRDDPartition

誰能幫我解決這個問題？

我嘗試用更新的庫替換 jar 庫，但問題仍然存在。

Answer 1

嘗試用更新的庫替換 jar 庫

不清楚你在做什么，但你不應該直接修改任何 JAR 文件。

運行應用程序時使用--packages選項。 即最新的 Spark 2.3.x ，你需要這個 package

spark-submit --master=local \
  --packages='org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.4' \
  app.py

我這里有一個 Jupyter 示例 - https://github.com/OneCricketeer/docker-stacks/blob/master/hadoop-spark/spark-notebooks/kafka-sql.ipynb

java.lang.ClassNotFoundException: org.apache.spark.sql.kafka010.KafkaSourceRDDPartition

問題描述

1 個解決方案

解決方案1
0 2023-01-07 00:07:50

java.lang.ClassNotFoundException: org.apache.spark.sql.kafka010.KafkaSourceRDDPartition

問題描述

1 個解決方案

解決方案1 0 2023-01-07 00:07:50

解決方案1
0 2023-01-07 00:07:50