调用 udf 以触发数据帧时任务不可序列化错误

Question

I have a scala function for encryption, then created a udf out of it and passing it to one of the columns in my als_embeddings data frame to get new column added to my dataframe.我有一个Scala的功能进行加密，然后创建一个udf了出来，并把它传递给其中一列，我als_embeddings数据帧被添加到我的数据帧新列。

import java.util.Base64
import javax.crypto.Cipher
import javax.crypto.spec.{IvParameterSpec, SecretKeySpec}

val Algorithm = "AES/CBC/PKCS5Padding"
val Key = new SecretKeySpec(Base64.getDecoder.decode("BiwHeIqzQa8X6MXtdg/hhQ=="), "AES")
val IvSpec = new IvParameterSpec(new Array[Byte](16))

def encrypt(text: String): String = {
  val cipher = Cipher.getInstance(Algorithm)
  cipher.init(Cipher.ENCRYPT_MODE, Key, IvSpec)

  new String(Base64.getEncoder.encode(cipher.doFinal(text.getBytes("utf-8"))), "utf-8")
}


val encryptUDF = udf((uid : String) => encrypt(uid))

passing above encryptUDF to my spark dataframe to create a new column with encrypted uid将encryptUDF传递给我的 spark 数据encryptUDF以创建一个带有加密uid的新列

val als_encrypt_embeddings = als_embeddings.withColumn("encrypt_uid",encryptUDF(col("uid")))
als_encrypt_embeddings.show()

but when I do this it is giving me below error:但是当我这样做时，它给了我以下错误：

Exception in thread "main" org.apache.spark.SparkException: Task not serializable线程“main”org.apache.spark.SparkException 中的异常：任务不可序列化

what am i missing here.我在这里错过了什么。

Answer 1

The error message Task not serializable is correct but not very clear.错误消息Task not serializable是正确的，但不是很清楚。 Further down in the stacktrace there is a more detailed explanation what went wrong:在堆栈跟踪的进一步下方，有更详细的说明出了什么问题：

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
[...]
Caused by: java.io.NotSerializableException: javax.crypto.spec.IvParameterSpec
Serialization stack:
    - object not serializable (class: javax.crypto.spec.IvParameterSpec, value: javax.crypto.spec.IvParameterSpec@7d4d65f5)
    - field (class: Starter$$anonfun$1, name: IvSpec$1, type: class javax.crypto.spec.IvParameterSpec)
    - object (class Starter$$anonfun$1, <function1>)
    - element of array (index: 2)
    - array (class [Ljava.lang.Object;, size 3)
    - field (class: org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13, name: references$1, type: class [Ljava.lang.Object;)
    - object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13, <function2>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
    ... 48 more

In the Caused by part of the stacktrace Spark reports that it was not able to serialize an instance of javax.crypto.spec.IvParameterSpec .在Caused by部分堆栈跟踪Caused by Spark 报告中，它无法序列化javax.crypto.spec.IvParameterSpec的实例。

The ParameterSpec has been created within the driver JVM while the udf is being executed at one of the executors. ParameterSpec 已在驱动程序 JVM 中创建，同时 udf 在其中一个执行程序中执行。 Therefore the object has to be serialized in order to move it to the executor's VM.因此，必须对对象进行序列化才能将其移动到执行程序的 VM。 As the object is not serializable, the attempt to move it fails.由于对象不可序列化，移动它的尝试失败。

The easiest way to fix the problem is to create the objects needed for the encryption directly within the executor's VM by moving the code block into the udf's closure:解决问题的最简单方法是通过将代码块移动到 udf 的闭包中，直接在执行器的 VM 中创建加密所需的对象：

val encryptUDF = udf((uid : String) => {
  val Algorithm = "AES/CBC/PKCS5Padding"
  val Key = new SecretKeySpec(Base64.getDecoder.decode("BiwHeIqzQa8X6MXtdg/hhQ=="), "AES")
  val IvSpec = new IvParameterSpec(new Array[Byte](16))

  def encrypt(text: String): String = {
    val cipher = Cipher.getInstance(Algorithm)
    cipher.init(Cipher.ENCRYPT_MODE, Key, IvSpec)

    new String(Base64.getEncoder.encode(cipher.doFinal(text.getBytes("utf-8"))), "utf-8")
  }
  encrypt(uid)
})

This way all objects will be directly created within the executors VM.这样，所有对象都将直接在 executors VM 中创建。

The downside of this approach is that there is one set of encryption object being created per invocation of the udf.这种方法的缺点是每次调用 udf 都会创建一组加密对象。 This might cause performance problems if the instantiation of these objects is expensive.如果这些对象的实例化开销很大，这可能会导致性能问题。 One option would be to use mapPartitions instead of an udf.一种选择是使用mapPartitions而不是udf 。 In this answer mapPartitions is used to avoid to create too many expensive database connections while iterating over a dataframe.在这个答案中， mapPartitions 用于避免在迭代数据帧时创建太多昂贵的数据库连接。 This approach could also be used here.这种方法也可以在这里使用。

Answer 2

We can define the function as part of a standalone object that has no references to unserializable values.我们可以将该函数定义为没有对不可序列化值的引用的独立对象的一部分。

object EncryptUtils extends Serializable {
  ...
  ...
  ...
}

调用 udf 以触发数据帧时任务不可序列化错误

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-03-16 22:03:23

解决方案2
0 2020-05-24 16:53:19

调用 udf 以触发数据帧时任务不可序列化错误

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-03-16 22:03:23

解决方案2 0 2020-05-24 16:53:19

解决方案1
1 已采纳 2020-03-16 22:03:23

解决方案2
0 2020-05-24 16:53:19