org.apache.spark.SparkException：UDF 中的任务不可序列化错误

Question

We are trying to fetch data from kafaka and doing deserialization for avro data format.我们正在尝试从 kafaka 获取数据并对 avro 数据格式进行反序列化。 Code is working fine till kafkaDataframe where data is fetched from kafka topic but when trying to extract value from kafkaDataframe using deserialize() UDF method.代码在 kafkaDataframe 之前工作正常，其中数据是从 kafka 主题中获取的，但是当尝试使用 deserialize() UDF 方法从 kafkaDataframe 中提取值时。 It is throwing exception as Task not serialiable and java.io.NotSerializableException: io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.它抛出异常作为任务不可序列化和 java.io.NotSerializableException: io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient。 Request anyone to help us to resolve this issue.请求任何人帮助我们解决这个问题。

Code is copied from [medium link] https://github.com/xebia-france/spark-structured-streaming-blog/blob/master/src/main/scala/AvroConsumer.scala代码复制自 [medium link] https://github.com/xebia-france/spark-structured-streaming-blog/blob/master/src/main/scala/AvroConsumer.scala

import com.databricks.spark.avro.SchemaConverters
import io.confluent.kafka.schemaregistry.client.{CachedSchemaRegistryClient, SchemaRegistryClient}
import io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.spark.sql.SparkSession
import org.apache.kafka.common.serialization.ByteArrayDeserializer

val topic = "topic"
val kafkaUrl = "kafkaUrl"
val schemaRegistryUrl = "schemaRegistryUrl"
val schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)


class AvroDeserializer extends AbstractKafkaAvroDeserializer with Serializable {
    def this(client: SchemaRegistryClient) {
      this()
      this.schemaRegistry = client
    }

    override def deserialize(bytes: Array[Byte]): String = {
      val genericRecord = this.deserialize(bytes).asInstanceOf[GenericRecord]
      genericRecord.toString
    }
  }

val kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)

val avroSchema = schemaRegistryClient.getLatestSchemaMetadata(topic + "-value").getSchema

val sparkSchema = SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))

object DeserializerWrapper extends Serializable{
    val deserializer = kafkaAvroDeserializer
}

spark.udf.register("deserialize", (bytes: Array[Byte]) => DeserializerWrapper.deserializer.deserialize(bytes))

val kafkaDataFrame = spark.read.format("kafka").option("kafka.bootstrap.servers", kafkaUrl).option("subscribe", topic).option("startingOffsets", "earliest").load()

kafkaDataFrame.show() // This code work fine in console

val valueDataFrame = kafkaDataFrame.selectExpr("deserialize(value) AS message")

valueDataFrame.show() // This code is not working and throwing exception as "org.apache.spark.SparkException: Task not serializable" and "java.io.NotSerializableException: AvroDeserializer"

Below is the full exception trace for your reference.以下是完整的异常跟踪供您参考。

scala> valueDataFrame.show()
org.apache.spark.SparkException: Task not serializable
  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:345)
  at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2299)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849)
  at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:613)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:337)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3278)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3259)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3258)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2489)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2703)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
  ... 49 elided
Caused by: java.io.NotSerializableException: io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient
Serialization stack:
    - object not serializable (class: io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient, value: io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient@40199d5e)
    - field (class: $iw, name: schemaRegistryClient, type: class io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient)
    - object (class $iw, $iw@2d579733)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@539c833d)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@7b217a33)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@6b21a869)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@5f849a79)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@73372652)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@4ecd395f)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@15dbd88e)
    - field (class: $line25.$read, name: $iw, type: class $iw)
    - object (class $line25.$read, $line25.$read@234a21e9)
    - field (class: $iw, name: $line25$read, type: class $line25.$read)
    - object (class $iw, $iw@31d635ba)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@1172a648)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@69f7da24)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@53e0c50)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@5c5761f)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@67306a84)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@615a30bd)
    - field (class: $line38.$read, name: $iw, type: class $iw)
    - object (class $line38.$read, $line38.$read@6cc45cf2)
    - field (class: $iw, name: $line38$read, type: class $line38.$read)
    - object (class $iw, $iw@1b4bfdb)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@29a1aca8)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@70ebf6d8)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@9febb7c)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@2f2f6aaa)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@5b7bccc3)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@68fa9450)
    - field (class: $line41.$read, name: $iw, type: class $iw)
    - object (class $line41.$read, $line41.$read@170f2883)
    - field (class: $iw, name: $line41$read, type: class $line41.$read)
    - object (class $iw, $iw@3fa0f38a)
    - field (class: $iw, name: $outer, type: class $iw)
    - object (class $iw, $iw@48754a85)
    - field (class: $anonfun$1, name: $outer, type: class $iw)
    - object (class $anonfun$1, <function1>)
    - element of array (index: 3)
    - array (class [Ljava.lang.Object;, size 4)
    - field (class: org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10, name: references$1, type: class [Ljava.lang.Object;)
    - object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10, <function2>)
  at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
  at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
  at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
  ... 80 more

Answer 1

The error means you cannot use an instance of CachedSchemaRegistryClient shared on all Spark executors because it cannot be serialized.该错误意味着您不能使用在所有 Spark 执行程序上共享的CachedSchemaRegistryClient实例，因为它无法序列化。

As you cannot make it serializable (both because you don't own it and it doesn't make sense as it's probably a class that holds some IO.network resources), you have to somehow create one instance on each executor.因为你不能让它序列化（既因为你不拥有它，也因为它可能是一个 class 拥有一些 IO.network 资源，所以它没有意义），你必须以某种方式在每个执行者上创建一个实例。

How to achieve that in your case?在您的情况下如何实现？ I'm not sure to be honest because you're using the client in a UDF and I don't know its lifecycle.老实说，我不确定，因为您在 UDF 中使用客户端，而我不知道它的生命周期。 You should look for the way to make a UDF use a non-serializable class.您应该寻找使 UDF 使用不可序列化的 class 的方法。

org.apache.spark.SparkException：UDF 中的任务不可序列化错误

问题描述

1 个解决方案

解决方案1
0 2022-12-30 09:11:40

org.apache.spark.SparkException：UDF 中的任务不可序列化错误

问题描述

1 个解决方案

解决方案1 0 2022-12-30 09:11:40

解决方案1
0 2022-12-30 09:11:40