![](/img/trans.png)
[英]Exception in thread “main” org.apache.spark.SparkException: Task not serializable"
[英]org.apache.spark.SparkException: Task not serializable while writing stream to blob store
我已经浏览了很多类似的帖子,但我无法理解这里的原因。 我让整个代码工作。
我只是后来添加了以下代码:
import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper}
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
def toJson(value: Map[Symbol, Any]): String = {
toJson(value map { case (k,v) => k.name -> v})
}
def toJson(value: Any): String = {
mapper.writeValueAsString(value)
}
def toMap[V](json:String)(implicit m: Manifest[V]): Map[String, Any] = fromJson[Map[String,Any]](json)
def fromJson[T](json: String)(implicit m : Manifest[T]): T = {
mapper.readValue[T](json)
}
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
现在当我在笔记本中执行以下写入流单元时:
data.writeStream
.option("checkpointLocation", _checkpointLocation)
.format("avro")
.partitionBy("Date", "Hour")
.option("path", _containerPath)
.start()
我收到此错误:
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
Caused by: org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
Caused by: java.io.NotSerializableException: com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer
Serialization stack:
- object not serializable (class: com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer, value: com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer@660424b3)
- field (class: com.fasterxml.jackson.module.paranamer.ParanamerAnnotationIntrospector, name: _paranamer, type: interface com.fasterxml.jackson.module.paranamer.shaded.Paranamer)
谁能帮我理解这里可能出了什么问题? 谢谢!
这是罪魁祸首
import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper}
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
def toJson(value: Map[Symbol, Any]): String = {
toJson(value map { case (k,v) => k.name -> v})
}
def toJson(value: Any): String = {
mapper.writeValueAsString(value)
}
def toMap[V](json:String)(implicit m: Manifest[V]): Map[String, Any] = fromJson[Map[String,Any]](json)
def fromJson[T](json: String)(implicit m : Manifest[T]): T = {
mapper.readValue[T](json)
}
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
这意味着您的 json 解析器不可序列化尝试为您的 json 类/对象实现可序列化
尝试使用Gson或
class Jsonparser extends serializable
将是解决方案。
看看这里的任务是如何发生的- org.apache.spark.SparkException: Task not serializable
这个答案对您来说可能为时已晚,但希望它可以帮助其他人。
您不必放弃并切换到 Gson。 我更喜欢 jackson 解析器,因为它是 spark.read.json() 在幕后使用的 spark,并且不需要我们获取外部工具。 看起来我们都使用 (com.fasterxml.jackson.module:jackson-module-jsonSchema:2.9.6) 它允许在 scala 中直接反序列化 json。
无论如何,实际上有几种不同的方法可以绕过这个特定的限制。 我更喜欢保持 kryo 序列化,所以我的建议是用一个类来包装。 我通常将它作为自动广播变量访问,并在数据帧上的 mapPartitions 中实例化到 rdd 方法。 如果我正在进行即时流转换,我会使用 foreachBatch。 我的答案以非流式格式提供,以使其对其他人更有用。
import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper}
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
class FastJson extends Serializable {
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
def toJson(value: Map[Symbol, Any]): String = {
toJson(value map { case (k,v) => k.name -> v})
}
def toJson(value: Any): String = {
mapper.writeValueAsString(value)
}
def toMap[V](json:String)(implicit m: Manifest[V]) = fromJson[Map[String,V]](json)
def fromJson[T](json: String)(implicit m : Manifest[T]): T = {
mapper.readValue[T](json)
}
}
val sample_records = Array("""{"id":"1c589374-4a1f-11ea-976f-02423e9c355b","offset":"1004468136126","occurred":"2020-02-08T02:59:11.546Z","processed":"2020-02-08T03:00:07.078Z","device":{"that":"57bee599-8fa5-4e98-af88-17367b2dc327"}}""",
"""{"id" : "8d4de2f2-fdf8-11ea-a111-02424cf4aed7","offset" : "1005930726202","occurred":"2020-09-23T23:57:27.211Z","processed":"2020-09-23T23:57:35.390Z","device" : {"this":"123456"}}""")
val sample_df = spark.sparkContext.parallelize(sample_records).toDF
sample_df.printSchema
println(s"count : ${sample_df.count()}")
sample_records: Array[String] = Array({"id":"1c589374-4a1f-11ea-976f-02423e9c355b","offset":"1004468136126","occurred":"2020-02:54:08TZ01 ,"已处理":"2020-02-08T03:00:07.078Z","设备":{"that":"57bee599-8fa5-4e98-af88-17367b2dc327"}}, {"id":"8d4de2f2-fdf -11ea-a111-02424cf4aed7","offset": "1005930726202","occurred":"2020-09-23T23:57:27.211Z","processed":"2020-09-270Z"23. "device" : {"this":"123456"}}) sample_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: string] root |-- value: string (nullable = true) 计数:2
val json_action_df = sample_df.rdd.mapPartitions(partition => {
val fastJson = new FastJson
val partition_results = partition.map(row => {
val stringified_row = row.getAs[String]("value")
val swapJsonMap = fastJson.fromJson[Map[String, Any]](stringified_row)
val id_field = swapJsonMap.get("id").getOrElse("")
id_field.asInstanceOf[String]
})
partition_results.toIterator
}).toDF("id")
json_action_df.show(10, false)
+------------------------------------+ |id | +------------------------------------+ |1c589374-4a1f-11ea-976f-02423e9c355b| |8d4de2f2-fdf8-11ea-a111-02424cf4aed7| +------------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.