简体   繁体   English

在 Spark JavaRDD 转换中使用 Serializable lambda

[英]Use Serializable lambda in Spark JavaRDD transformation

I am trying to understand the following code.我正在尝试理解以下代码。

// File: LambdaTest.java // 文件:LambdaTest.java

package test;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import java.io.Serializable;
import java.util.Arrays;
import java.util.List;
import java.util.function.Function;

public class LambdaTest implements Ops {

  public static void main(String[] args) {
    new LambdaTest().job();
  }

  public void job() {
    SparkConf conf = new SparkConf()
      .setAppName(LambdaTest.class.getName())
      .setMaster("local[*]");
    JavaSparkContext jsc = new JavaSparkContext(conf);

    List<Integer>              lst  = Arrays.asList(1, 2, 3, 4, 5, 6);
    JavaRDD<Integer>           rdd  = jsc.parallelize(lst);
    Function<Integer, Integer> func1 = (Function<Integer, Integer> & Serializable) x -> x * x;
    Function<Integer, Integer> func2 = x -> x * x;

    System.out.println(func1.getClass());  //test.LambdaTest$$Lambda$8/390374517
    System.out.println(func2.getClass());  //test.LambdaTest$$Lambda$9/208350681

    this.doSomething(rdd, func1);  // works
    this.doSomething(rdd, func2);  // org.apache.spark.SparkException: Task not serializable
  }
}

// File: Ops.java // 文件:Ops.java

package test;

import org.apache.spark.api.java.JavaRDD;
import java.util.function.Function;    

public interface Ops {

  default void doSomething(JavaRDD<Integer> rdd, Function<Integer, Integer> func) {
    rdd.map(x -> x + func.apply(x))
       .collect()
       .forEach(System.out::println);
  }

}

The difference is func1 is casted with a Serializable bound, while func2 is not.不同之处在于func1是使用Serializable绑定的,而func2不是。

When looking at the run time class of the two functions, they are both anonymous class under LambdaTest class查看这两个函数的运行时类时,它们都是LambdaTest类下的匿名类

They are both used in an RDD transformation in an interface, then the two functions and LambdaTest should be serializable.它们都用于接口中的 RDD 转换,那么这两个函数和LambdaTest应该是可序列化的。

As you see, LambdaTest does not implement Serializable interface.如您所见, LambdaTest没有实现Serializable接口。 So I think the two func should not work.所以我认为这两个 func 不应该工作。 But surprisingly, func1 works.但令人惊讶的是, func1有效。

The stack trace for func2 is the following: func2的堆栈跟踪如下:

Serialization stack:
    - object not serializable (class: test.LambdaTest$$Lambda$9/208350681, value: test.LambdaTest$$Lambda$9/208350681@61d84e08)
    - element of array (index: 0)
    - array (class [Ljava.lang.Object;, size 1)
    - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
    - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=interface fr.leboncoin.etl.jobs.test.Ops, functionalInterfaceMethod=org/apache/spark/api/java/function/Function.call:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic fr/leboncoin/etl/jobs/test/Ops.lambda$doSomething$1024e30a$1:(Ljava/util/function/Function;Ljava/lang/Integer;)Ljava/lang/Integer;, instantiatedMethodType=(Ljava/lang/Integer;)Ljava/lang/Integer;, numCaptured=1])
    - writeReplace data (class: java.lang.invoke.SerializedLambda)
    - object (class fr.leboncoin.etl.jobs.test.Ops$$Lambda$10/1470295349, fr.leboncoin.etl.jobs.test.Ops$$Lambda$10/1470295349@4e1459ea)
    - field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function)
    - object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, <function1>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
    ... 19 more

It seems that if a function bound with Serializable , the object containing it need not to be serialized, which makes me confused.似乎如果一个函数与Serializable绑定,则包含它的对象不需要序列化,这让我感到困惑。

Any explanation on this is highly appreciated.对此的任何解释都将受到高度赞赏。

------------------------------ Updates ------------------------------ - - - - - - - - - - - - - - - 更新 - - - - - - - - - - ------------

I have tried to use abstract class instead of interface:我尝试使用抽象类而不是接口:

//File: AbstractTest.java //文件:AbstractTest.java

public class AbstractTest {

  public static void main(String[] args) {
    new AbstractTest().job();
  }

  public void job() {
    SparkConf conf = new SparkConf()
      .setAppName(AbstractTest.class.getName())
      .setMaster("local[*]");
    JavaSparkContext jsc = new JavaSparkContext(conf);

    List<Integer>    lst = Arrays.asList(1, 2, 3, 4, 5, 6);
    JavaRDD<Integer> rdd = jsc.parallelize(lst);

    Ops ops = new Ops() {

      @Override
      public Integer apply(Integer x) {
        return x + 1;
      }
    };

    System.out.println(ops.getClass()); // class fr.leboncoin.etl.jobs.test.AbstractTest$1
    ops.doSomething(rdd);
  }
}

// File: Ops.java // 文件:Ops.java

public abstract class Ops implements Serializable{

  public abstract Integer apply(Integer x);

  public void doSomething(JavaRDD<Integer> rdd) {
    rdd.map(x -> x + apply(x))
       .collect()
       .forEach(System.out::println);
  }
}

It does not work either, even if Ops class is compiled in separate files with AbstractTest class.即使Ops类使用AbstractTest类编译在单独的文件中,它也不起作用。 The ops object's class name is class fr.leboncoin.etl.jobs.test.AbstractTest$1 . ops对象的类名是class fr.leboncoin.etl.jobs.test.AbstractTest$1 According to the following stack track, it seem that it needs to serialize AbstractTest in order to serialize AbstractTest$1 .根据以下堆栈跟踪,似乎需要序列化AbstractTest才能序列化AbstractTest$1

Serialization stack:
    - object not serializable (class: test.AbstractTest, value: test.AbstractTest@21ac5eb4)
    - field (class: test.AbstractTest$1, name: this$0, type: class test.AbstractTest)
    - object (class test.AbstractTest$1, test.AbstractTest$1@36fc05ff)
    - element of array (index: 0)
    - array (class [Ljava.lang.Object;, size 1)
    - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
    - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class fr.leboncoin.etl.jobs.test.Ops, functionalInterfaceMethod=org/apache/spark/api/java/function/Function.call:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeSpecial fr/leboncoin/etl/jobs/test/Ops.lambda$doSomething$6d6228b6$1:(Ljava/lang/Integer;)Ljava/lang/Integer;, instantiatedMethodType=(Ljava/lang/Integer;)Ljava/lang/Integer;, numCaptured=1])
    - writeReplace data (class: java.lang.invoke.SerializedLambda)
    - object (class fr.leboncoin.etl.jobs.test.Ops$$Lambda$8/208350681, fr.leboncoin.etl.jobs.test.Ops$$Lambda$8/208350681@4acb2510)
    - field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function)
    - object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, <function1>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
    ... 19 more

LambdaTest doesn't need to be Serializable as it's not being sent over the wire - there's no reason to do that. LambdaTest不需要是可Serializable的,因为它不是通过网络发送的——没有理由这样做。

On the other hand both func1 and func1 do have to be Serializable as Spark will be using them to perform computation (on the RDD and therefore this code will have to be sent over the wire to the worker nodes. Notice that even though you write it all in the same class, after compilation your lambdas will be put in separate files, thanks to that the whole class doesn't have to be sent over the wire -> the outer class doesn't need to be Serializable .另一方面, func1func1都必须是可Serializable的,因为 Spark 将使用它们来执行计算(在RDD上,因此此代码必须通过线路发送到工作节点。请注意,即使您编写它都在同一个类中,编译后你的 lambdas 将被放在单独的文件中,这要归功于整个类不必通过网络发送 -> 外部类不需要是Serializable

As for why fun1 works, when you do not use type casting the Java compiler will infer the type of a lambda expression for you.至于为什么fun1有效,当您不使用类型转换时,Java 编译器会为您推断 lambda 表达式的类型。 So in this case the code generated for fun2 will simply implement a Function (since that's the target variable's type).所以在这种情况下,为fun2生成的代码将简单地实现一个Function (因为这是目标变量的类型)。 On the other hand if a type cannot be inferred from the context (like in your case, the compiler has no way of knowing that fun1 has to be Serializable since it's a feature required by Spark) you can use type casting as in your example to explicitly provide a type.另一方面,如果无法从上下文中推断出类型(例如在您的情况下,编译器无法知道fun1必须是可Serializable的,因为它是 Spark 所需的功能),您可以像示例中那样使用类型转换来显式提供一个类型。 In that case the code generated by the compiler will be implementing both the Function and Serializable interfaces and the compiler won't try to infer the type on it's own.在这种情况下,编译器生成的代码将同时实现FunctionSerializable接口,编译器不会尝试自行推断类型。

You can find it described in the state of lambda under 5. Contexts for target typing .你可以在5. Contexts for target typing的 lambda 状态中找到它的描述。

The above answer is correct.上面的答案是正确的。 As for the additional abstract class question, the answer is the abstract class implemented in AbstractTest class is a inner class, which one has the reference of outclass.至于附加抽象类的问题,答案是AbstractTest类中实现的抽象类是一个内部类,有outclass的引用。 When serializing a object, it will serialize its fields, an outclass AbstractTest is not Serializable, thus it can't be serialized.当序列化一个对象时,它会序列化它的字段,一个外类 AbstractTest 是不可序列化的,因此它不能被序列化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM