Spark-使用不可序列化的成員序列化對象

Question

我將在Spark上下文中提出這個問題，因為這就是我要面對的問題，但這可能是一個普通的Java問題。

在我們的Spark工作中，我們有一個Resolver ，需要在所有工作人員中使用它（在udf中使用）。 問題在於它不可序列化，我們無法將其更改為可序列化。 解決方案是將其作為可序列化的另一個類的成員。

因此，我們最終得到了：

public class Analyzer implements Serializable {
    transient Resolver resolver;

    public Analyzer() {
        System.out.println("Initializing a Resolver...");
        resolver = new Resolver();
    }

    public int resolve(String key) {
         return resolver.find(key);
    }
}

然后，我們使用Spark API broadcast此類：

 val analyzer = sparkContext.broadcast(new Analyzer())

（有關Spark廣播的更多信息，請點擊此處）

然后，作為火花代碼的一部分，我們繼續在UDF中使用analyzer ，如下所示：

val resolve = udf((key: String) => analyzer.value.resolve(key))
val result = myDataFrame.select("key", resolve("key")).count()

所有這些都按預期工作，但是讓我們納悶。

Resolver沒有實現Serializable ，因此被標記為transient -意味着它不會與所有者對象Analyzer一起被序列化。

但是從上面的代碼中可以清楚地看到， resolve()方法使用resolver ，因此它不能為null。 確實不是。 該代碼有效。

因此，如果該字段未通過序列化傳遞，則resolver成員如何實例化？

我最初的想法是，可能在接收方（即Spark工作者）調用了Analyzer構造函數，但隨后我希望看到"Initializing a Resolver..."行已多次打印。 但是它只打印了一次，這可能表明它在傳遞到廣播API之前就只被調用過一次。 那么為什么resolver不為null？

我是否缺少有關JVM序列化或Spark序列化的內容？

此代碼甚至如何工作？

Spark在cluster模式下在YARN上運行。 spark.serializer設置為org.apache.spark.serializer.KryoSerializer 。

Answer 1

因此，如果該字段未通過序列化傳遞，則解析器成員如何實例化？

在調用kryo.readObject時，可通過構造函數調用（ new Resolver ）對其進行實例化：

kryo.readClassAndObject(input).asInstanceOf[T]

我最初的想法是，可能在接收方（即Spark工作者）調用了Analyzer構造函數，但隨后我希望看到“ Initializing a Resolver ...”行已多次打印。 但是它只打印了一次，這可能表明它只被打印了一次

那不是廣播變量的工作方式。 發生的情況是，當每個執行程序都需要作用域中的broadcast變量時，它首先檢查其BlockManager中是否有內存中的對象，否則，它將詢問驅動程序或鄰居執行程序（如果存在多個執行程序）緩存實例的實例），然后將其序列化並將其發送給他，然后他接收該實例並將其緩存在自己的BlockManager 。

TorrentBroadcast的行為（這是默認的廣播實現）中對此進行了記錄：

* The driver divides the serialized object into small chunks and
* stores those chunks in the BlockManager of the driver.
*
* On each executor, the executor first attempts to fetch the object from its BlockManager. If
* it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or
* other executors if available. Once it gets the chunks, it puts the chunks in its own
* BlockManager, ready for other executors to fetch from.
*
* This prevents the driver from being the bottleneck in sending out multiple copies of the
* broadcast data (one per executor).

如果我們刪除瞬態，它將失敗，並且堆棧跟蹤會導致Kryo

這是因為您的Resolver類中可能存在一個字段，即使Serialable屬性，即使Kryo也無法對其進行序列Serializable 。

Spark-使用不可序列化的成員序列化對象

問題描述

1 個解決方案

解決方案1
3 2018-01-22 15:38:18

Spark-使用不可序列化的成員序列化對象

問題描述

1 個解決方案

解決方案1 3 2018-01-22 15:38:18

解決方案1
3 2018-01-22 15:38:18