Spark序列化和Java序列化有什么区别？

Question

I'm using Spark + Yarn and I have a service that I want to call on distributed nodes. 我正在使用Spark + Yarn，我有一个要在分布式节点上调用的服务。

When I serialize this service object "by hand" in a Junit test using java serialization, all inner collections of the service are well serialized and deserialized : 当我在使用Java序列化的Junit测试中“手动”序列化此服务对象时，该服务的所有内部集合都进行了很好的序列化和反序列化：

  @Test
  public void testSerialization() {  

    try (
        ConfigurableApplicationContext contextBusiness = new ClassPathXmlApplicationContext("spring-context.xml");
        FileOutputStream fileOutputStream = new FileOutputStream("myService.ser");
        ObjectOutputStream objectOutputStream = new ObjectOutputStream(fileOutputStream);
        ) {

      final MyService service = (MyService) contextBusiness.getBean("myServiceImpl");

      objectOutputStream.writeObject(service);
      objectOutputStream.flush();

    } catch (final java.io.IOException e) {
      logger.error(e.getMessage(), e);
    }
  }

  @Test
  public void testDeSerialization() throws ClassNotFoundException {  

    try (
        FileInputStream fileInputStream = new FileInputStream("myService.ser");
        ObjectInputStream objectInputStream = new ObjectInputStream(fileInputStream);
        ) {

      final MyService myService = (MyService) objectInputStream.readObject();

      // HERE a functionnal test who proves the service has been fully serialized and deserialized      .

    } catch (final java.io.IOException e) {
      logger.error(e.getMessage(), e);
    }
  }

But when I try to call this service via my Spark launcher, wether I broadcast the service object or not, some inner collection (a HashMap) disappears (is not serialized) like if it was tagged as "transient" (but it's not transient neither static) : 但是，当我尝试通过Spark启动器调用此服务时，无论是否广播该服务对象，某些内部集合（一个HashMap）都会消失（未序列化），就像它被标记为“瞬态”一样（但它也不是瞬态的）静态的）：

JavaRDD<InputOjbect> listeInputsRDD = sprkCtx.parallelize(listeInputs, 10);
JavaRDD<OutputObject> listeOutputsRDD = listeInputsRDD.map(new   Function<InputOjbect, OutputObject>() {
  private static final long serialVersionUID = 1L;

  public OutputObject call(InputOjbect input) throws TarificationXmlException { // Exception

    MyOutput output = service.evaluate(input);
    return (new OutputObject(output));
  }
});

same result if I broadcast the service : 如果我广播该服务，结果相同：

final Broadcast<MyService> broadcastedService = sprkCtx.broadcast(service);      
JavaRDD<InputOjbect> listeInputsRDD = sprkCtx.parallelize(listeInputs, 10);
JavaRDD<OutputObject> listeOutputsRDD = listeInputsRDD.map(new   Function<InputOjbect, OutputObject>() {
  private static final long serialVersionUID = 1L;

  public OutputObject call(InputOjbect input) throws TarificationXmlException { // Exception

    MyOutput output = broadcastedService.getValue().evaluate(input);
    return (new OutputObject(output));
  }
});

If I launch this same Spark code in local mode instead of yarn cluster mode, it works perfectly. 如果我在本地模式而不是纱线簇模式下启动相同的Spark代码，则效果很好。

So my question is : What is the difference between Spark Serialization and Java Serialization ? 所以我的问题是：Spark序列化和Java序列化有什么区别？ (I'm not using Kryo or any customized serialization). （我没有使用Kryo或任何自定义的序列化）。

EDIT : when I try with Kryo serializer (without registering explicitly any class), I have the same problem. 编辑：当我尝试使用Kryo序列化器（不显式注册任何类）时，我遇到了同样的问题。

Answer 1

Ok, I've found it out thanks to one of our experimented data analyst. 好的，我已经找到了这一点，这要归功于我们一位经过实验的数据分析师。

So, what was this mystery about ? 那么，这个奥秘是什么呢？

It was NOT about serialization (java or Kryo) 这与序列化无关（java或Kryo）
It was NOT about some pre-treatment or post-treatment Spark would do before/after serialization 这与Spark在序列化之前/之后进行的预处理或后处理无关
It was NOT about the HashMap field which is fully serializable (this one is obvious if u read the first example I give, but not for everyone ;) 这与完全可序列化的HashMap字段无关（如果您阅读了我给出的第一个示例，那么这很明显，但并不适合所有人；）

So... 所以...

The whole problem was about this : 整个问题是关于这个的：

"if I launch this same Spark code in local mode instead of yarn cluster mode, it works perfectly." “如果我在本地模式而不是纱线簇模式下启动相同的Spark代码，则效果很好。”

In "yarn cluster" mode the collection was unable to be initialized, cause it was launched on a random node and couldn't access to, the initial reference datas on disk. 在“纱线群集”模式下，无法初始化集合，因为它是在随机节点上启动的，并且无法访问磁盘上的初始参考数据。 In local mode, there was a clear exception when the initial datas where not found on disk, but in cluster mode it was fully silent and it looked like the problem was about serialization. 在本地模式下，当没有在磁盘上找到初始数据时有一个明显的例外，但是在群集模式下，它是完全静默的，看起来问题出在序列化上。

Using "yarn client" mode solved this for us. 使用“纱线客户端”模式为我们解决了这一问题。

Spark序列化和Java序列化有什么区别？

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-06-02 18:25:02

Spark序列化和Java序列化有什么区别？

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-06-02 18:25:02

解决方案1
2 已采纳 2015-06-02 18:25:02