简体   繁体   English

Spark:DataFrame 上的 UDF 任务不可序列化

[英]Spark: Task not Serializable for UDF on DataFrame

I get org.apache.spark.SparkException: Task not serializable when I try to execute the following on Spark 1.4.1:当我尝试在 Spark 1.4.1 上执行以下操作时,我得到org.apache.spark.SparkException: Task not serializable

import java.sql.{Date, Timestamp}
import java.text.SimpleDateFormat

object ConversionUtils {
  val iso8601 = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSX")

  def tsUTC(s: String): Timestamp = new Timestamp(iso8601.parse(s).getTime)

  val castTS = udf[Timestamp, String](tsUTC _)
}

val df = frame.withColumn("ts", ConversionUtils.castTS(frame("ts_str")))
df.first

Here, frame is a DataFrame that lives within a HiveContext .在这里, frame是一个位于DataFrame中的HiveContext That data frame does not have any issues.该数据框没有任何问题。

I have similar UDFs for integers and they work without any problem.我有类似的整数 UDF,它们可以正常工作。 However, the one with timestamps seems to cause problems.但是,带有时间戳的那个似乎会引起问题。 According to the documentation , java.sql.TimeStamp implements Serializable , so that's not the problem.根据文档java.sql.TimeStamp实现了Serializable ,所以这不是问题。 The same is true for SimpleDateFormat as can be seen here . SimpleDateFormat也是如此,可以在这里看到。

This causes me to believe it's the UDF that's causing problems.这让我相信是 UDF 导致了问题。 However, I'm not sure what and how to fix it.但是,我不确定是什么以及如何解决它。

The relevant section of the trace:跟踪的相关部分:

Caused by: java.io.NotSerializableException: ...
Serialization stack:
        - object not serializable (class: ..., value: ...$ConversionUtils$@63ed11dd)
        - field (class: ...$ConversionUtils$$anonfun$3, name: $outer, type: class ...$ConversionUtils$)
        - object (class ...$ConversionUtils$$anonfun$3, <function1>)
        - field (class: org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2, name: func$2, type: interface scala.Function1)
        - object (class org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2, <function1>)
        - field (class: org.apache.spark.sql.catalyst.expressions.ScalaUdf, name: f, type: interface scala.Function1)
        - object (class org.apache.spark.sql.catalyst.expressions.ScalaUdf, scalaUDF(ts_str#2683))
        - field (class: org.apache.spark.sql.catalyst.expressions.Alias, name: child, type: class org.apache.spark.sql.catalyst.expressions.Expression)
        - object (class org.apache.spark.sql.catalyst.expressions.Alias, scalaUDF(ts_str#2683) AS ts#7146)
        - element of array (index: 35)
        - array (class [Ljava.lang.Object;, size 36)
        - field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
        - object (class scala.collection.mutable.ArrayBuffer,

Try:尝试:

object ConversionUtils extends Serializable {
  ...
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM