使用 Java 进行 Apache Spark 对象序列化

Question

I'm trying to prepare a Library (written in Java) to run on Apache-Spark.我正在尝试准备一个库（用 Java 编写）以在 Apache-Spark 上运行。 Since the Library has hundreds of classes and still in active development stage, I do not want to serialize all of them one by one.由于图书馆有数百个类并且仍处于活跃的开发阶段，我不想将它们一一序列化。 Instead I searched for another method and found this , but again it does not resolve the serialization issue.相反，我搜索了另一种方法并找到了this ，但它再次没有解决序列化问题。

here is the code sample:这是代码示例：

    List<Integer> data = Arrays.asList(1,2,3,4,5,6);
    JavaRDD<Integer> distData = sc.parallelize(data);
    JavaRDD<Year4D> years = distData.map(y -> func.call(y));
    List<Year4D> years1 = years.collect();

where func is a Function that generates 4 digit Year with using Year4D;其中 func 是一个使用 Year4D 生成 4 位年份的函数；

    private static Function<Integer, Year4D> func = new Function<Integer, Year4D>() {
    public Year4D call(Integer arg0) throws Exception {
        return new Year4D(arg0);
    }};

and Year4D does not implement Serializable;而且 Year4D 没有实现 Serializable；

public class Year4D{
private int year = 0;
public Year4D(int year) {
    if (year < 1000) year += (year < 70) ? 2000 : 1900;
    this.year = year;
}
public String toString() {
    return "Year4D [year=" + year + "]";
}}

Which produce "object not serializable" exception for the Year4D:为 Year4D 产生“对象不可序列化”异常：

Job aborted due to stage failure: Task 6.0 in stage 0.0 (TID 6) had a not serializable result...

by the way, if I replace the Command Action collect() into foreach(func) it works,顺便说一句，如果我将命令操作 collect() 替换为 foreach(func) 它可以工作，

So, my question is why collect() not works?所以，我的问题是为什么 collect() 不起作用？

And If this approach is not good, what is the best practice to handle a Java Library which contains that much tons of complex classes?如果这种方法不好，那么处理包含大量复杂类的 Java 库的最佳实践是什么？

PS.附注。 @Tzach said that Year4D isn't wrapped correctly so actually it is not serialized, then what is the correct implementation? @Tzach 说 Year4D 没有正确包装所以实际上它没有序列化，那么正确的实现是什么？

Answer 1

Solution 1 (which you will not use, since it is easier to modify each of the classes by making them implement Serializable ): create wrapper classes that implement Serializable and overwrite their writeObject and readObject methods解决方案 1 （您不会使用它，因为通过使它们implement Serializable来修改每个类更容易）：创建实现Serializable包装类并覆盖它们的writeObject和readObject方法

public class Year4DWraper implements Serializable{

    private Year4D year4d;

    public Year4DWraper(Year4D year4d) {
        this.year4d = year4d;
    }
    public Year4D getYear4D(){
        return yeard4D;
    }

    private void writeObject(ObjectOutputStream os)
            throws IOException {
       os.writeInt(year4D.getYear());

    }

    private void readObject(ObjectInputStream is)
            throws IOException, ClassNotFoundException {
       int year = is.readInt();
       year4D = new Yeard4D(year);
    }

}

Solution 2 : Use Kyro to do the serialization/deserialization for you解决方案2 ：使用Kyro为你做序列化/反序列化

SparkConf conf = new SparkConf();
conf.set("spark.kryo.registrator", "org.apache.spark.examples.MyRegistrator");
...

public class MyRegistrator implements KryoRegistrator {
    public void registerClasses(Kryo kryo) {
        kryo.register(Year4D.class);
    }
}

It is advised that the classes contain a no-arg constructor .建议类包含无参数构造函数。

By default, most classes will end up using FieldSerializer .默认情况下，大多数类最终会使用FieldSerializer 。 It essentially does what hand written serialization would, but does it automatically.它本质上执行手写序列化的操作，但会自动执行。 FieldSerializer does direct assignment to the object's fields. FieldSerializer 直接分配给对象的字段。 If the fields are public, protected, or default access (package private) and not marked as final, bytecode generation is used for maximum speed (see ReflectASM).如果字段是公共、受保护或默认访问（包私有）并且未标记为最终，则使用字节码生成以实现最大速度（请参阅 ReflectASM）。 For private fields, setAccessible and cached reflection is used, which is still quite fast.对于私有字段，使用了 setAccessible 和缓存反射，还是挺快的。

If you are unhappy with the seriliazers Kyro provides by default or you have complex classes, you can always define your own.如果您对 Kyro 默认提供的 seriliazers 不满意，或者您有复杂的类，您可以随时定义自己的类。

Answer 2

First, foreach() works because it iterates over each partition locally, so it doesn't have to send the data from one node to another, or to the driver, so no Year4D has to be serialized.首先， foreach()之所以起作用，是因为它在本地遍历每个分区，因此它不必将数据从一个节点发送到另一个节点或驱动程序，因此不必序列化Year4D 。

If you follow the map transformation (which creates the Year4D objects) with any action / transformation that requires shuffle (eg groupByKey ), or that requires sending the data back to the driver (like collect ) - then the data must be serialized (how else would it be shared across separate Java processes?).如果您使用任何需要shuffle （例如groupByKey ）或需要将数据发送回驱动程序（如collect ）的操作/转换来遵循map转换（创建Year4D对象），那么数据必须被序列化（还有其他方式）它会在不同的 Java 进程之间共享吗？）。

Now, since there's very little you can do without shuffles or collecting the data - most likely, you don't really have a choice, your data must be serializable.现在，由于没有 shuffle 或收集数据，您几乎无能为力 - 很可能，您真的别无选择，您的数据必须是可序列化的。

使用 Java 进行 Apache Spark 对象序列化

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-03-17 13:34:36

解决方案2
1 2016-03-16 13:51:38

使用 Java 进行 Apache Spark 对象序列化

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-03-17 13:34:36

解决方案2 1 2016-03-16 13:51:38

解决方案1
2 已采纳 2016-03-17 13:34:36

解决方案2
1 2016-03-16 13:51:38