简体   繁体   English

带有 Spark 3.0.1 结构化流的 Kafka:ClassException:org.apache.kafka.common.TopicPartition; class 对反序列化无效

[英]Kafka with Spark 3.0.1 Structured Streaming : ClassException: org.apache.kafka.common.TopicPartition; class invalid for deserialization

I am trying to read the kafka messages in google dataproc using pyspark - structured streaming.我正在尝试使用 pyspark - 结构化流来读取 google dataproc 中的 kafka 消息。

Version details are:版本详情如下:

  1. dataproc image verison is 2.0.0-RC22-debian10 (to get pyspark 3.0.1 verison with delta lake 0.7.0 as I have to finally write this data to delta hosted on google storage) dataproc 映像版本是 2.0.0-RC22-debian10(要获得 pyspark 3.0.1 版本和 delta Lake 0.7.0,因为我必须最终将此数据写入托管在谷歌存储上的 delta)
  2. pyspark version 3.0.1 and python version used by pyspark is 3.7.3 pyspark 版本 3.0.1 和 pyspark 使用的 python 版本是 3.7.3
  3. The packages I am using is org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1我使用的包是 org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1
    io.delta:delta-core_2.12:0.7.0 io.delta:delta-core_2.12:0.7.0
    org.apache.spark:spark-avro_2.12:3.0.1 org.apache.spark:spark-avro_2.12:3.0.1

Snippet of the code is:代码片段是:

__my_dir = os.path.dirname("<directory_path>") 
jsonFormatSchema = open(os.path.join(__my_dir, 'avro_schema_file.avsc'), "r").read() 

df = spark \    
    .readStream \    
    .format("kafka") \   
    .option("kafka.bootstrap.servers", "<kafka_broker>") \   
    .option("subscribe", "<kafka_topic>") \    
    .option("startingOffsets", "latest") \    
    .load()\    
    .select(from_avro("value", jsonFormatSchema)
    .alias("element"))

df.printSchema()
 
df_output =     df.select("element.after.id","element.after.name","element.after.attribute","element.after.quantity")

StreamQuery = ( df_output.writeStream \ 
               .format("delta") \   
               .outputMode("append") \   
               .option("checkpointLocation","<check_point_location>") \   
               .trigger(once=True) \    
               .start("<target_delta_table>") \    )

Error I am getting is:我得到的错误是:

java.io.InvalidClassException: org.apache.kafka.common.TopicPartition;
class invalid for deserialization

Why spark fails to deserialize TopicPartition and how can I solve it?为什么 spark 无法反序列化 TopicPartition,我该如何解决?

The following post helped to resolve this issue: How to use Confluent Schema Registry with from_avro standard function?以下帖子帮助解决了这个问题: How to use Confluent Schema Registry with from_avro standard function?

In addition, we started pointing to the following jars for kafka-client另外,我们开始为kafka-client指向下面的jars

kafka-clients-2.4.1.jar kafka-clients-2.4.1.jar

The error disappears when you set the master to local[*] .当您将 master 设置为local[*]时,错误消失。 Anyhow, the problem seems to be related with a jar conflict between the driver and the executor: they use different versions of the kafka-clients library.无论如何,这个问题似乎与驱动程序和执行程序之间的 jar 冲突有关:它们使用不同版本的 kafka-clients 库。

To solve the problem you may want to launch jobs with为了解决这个问题,您可能需要启动作业

gcloud dataproc jobs submit spark \
 --class <YOUR_CLASS> \
 --jars target/scala-2.12/<COMPILED_JAR_FILE>.jar,kafka-clients-2.4.1.jar  \
 --cluster <CLUSTER_NAME> \
 --region= <YOUR_REGION> \
 --properties spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,spark.executor.extraClassPath=org.apache.kafka_kafka-clients-2.4.1.jar,spark.driver.extraClassPath=org.apache.kafka_kafka-clients-2.4.1.jar

This works in my case.这适用于我的情况。 Please check https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10_2.12/3.0.1 for further details on versions.有关版本的更多详细信息,请查看https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10_2.12/3.0.1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 结构化流卡夫卡火花 java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging - Structured Streaming kafka spark java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer 用于火花流 - java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer for spark streaming org.apache.kafka.common.KafkaException:SaleRequestFactory 类不是 org.apache.kafka.common.serialization.Serializer 的实例 - org.apache.kafka.common.KafkaException: class SaleRequestFactory is not an instance of org.apache.kafka.common.serialization.Serializer Spark因org.apache.kafka.common.serialization.StringDeserializer的NoClassDefFoundError而失败 - Spark fails with NoClassDefFoundError for org.apache.kafka.common.serialization.StringDeserializer Apache Spark Streaming 与 Java &amp; Kafka - Apache Spark Streaming with Java & Kafka 使用Java Kafka进行Spark结构化流式编程 - Spark Structured Streaming Programming with Kafka in Java Java中的对象不可序列化(org.apache.kafka.clients.consumer.ConsumerRecord)spark kafka流 - Object not serializable (org.apache.kafka.clients.consumer.ConsumerRecord) in Java spark kafka streaming 类未找到火花流和kafka - Class Not found spark streaming and kafka KafkaException:class 不是 org.apache.kafka.common.serialization.Deserializer 的实例 - KafkaException: class is not an instance of org.apache.kafka.common.serialization.Deserializer Java Kafka结构化流 - Java Kafka Structured Streaming
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM