spark中的KernelDensity序列化錯誤

Question

最近我在 Spark 中使用 KernelDensity class，我嘗試在 windows10 中將其序列化到我的磁盤，這是我的代碼：

// read sample from disk
val sample = spark.read.option("inferSchema", "true").csv("D:\\sample")
val trainX = sample.select("_c1").rdd.map(r => r.getDouble(0))

val kd = new KernelDensity().setSample(trainX).setBandwidth(1)
// Serialization
val oos = new ObjectOutputStream(new FileOutputStream("a.obj"))
oos.writeObject(kd)
oos.close()
// deserialization
val ios = new ObjectInputStream(new FileInputStream("a.obj"))
val kd1 = ios.readObject.asInstanceOf[KernelDensity]
ios.close()
// error comes when I use estimate
kd1.estimate(Array(1,2, 3))

Exception in thread "main" org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases: 
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
    at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:90)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.aggregate(RDD.scala:1117)
    at org.apache.spark.mllib.stat.KernelDensity.estimate(KernelDensity.scala:92)
    at KernelDensityConstruction$.main(KernelDensityConstruction.scala:35)
    at KernelDensityConstruction.main(KernelDensityConstruction.scala)
20/05/10 22:05:42 INFO SparkContext: Invoking stop() from shutdown hook

為什么它不起作用？ 如果我不做序列化操作，它工作得很好。

Answer 1

發生這種情況是因為KernelDensity盡管是正式的Serializable ，但並未設計為與標准序列化工具完全兼容。

在內部，它包含對樣本的引用，而樣本又取決於相應SparkContext 。 換句話說，它是分布式工具，旨在用於單個活動 session 的 scope 內。

鑒於：

在調用estimate之前，它不會執行任何計算。
它需要樣本 RDD 來評估estimate 。

首先序列化它並沒有什么意義——您可以在新的上下文中根據所需的參數簡單地重新創建 object。

然而，如果你真的想序列化整個事情，你應該創建一個包裝器來序列化參數和相應的 RDD（以類似的方式使用分布式數據結構的 ML 模型，如 ALS，工作）並將它們加載回來，在一個新的 session。

spark中的KernelDensity序列化錯誤

問題描述

1 個解決方案

解決方案1
0 2020-05-10 15:31:11

spark中的KernelDensity序列化錯誤

問題描述

1 個解決方案

解決方案1 0 2020-05-10 15:31:11

解決方案1
0 2020-05-10 15:31:11