简体   繁体   中英

Avro serialization object not serializable issue

I am trying to produce and consume Avro messages using Kafka through spark streaming API. But Avro throws object not the serializable exception. I tried wrapping the datum using an AvroKey wrapper. Still, it is not working.

Producer code:

public static final String schema = "{"
    +"\"fields\": ["
    +   " { \"name\": \"str1\", \"type\": \"string\" },"
    +   " { \"name\": \"str2\", \"type\": \"string\" },"
    +   " { \"name\": \"int1\", \"type\": \"int\" }"
    +"],"
    +"\"name\": \"myrecord\","
    +"\"type\": \"record\""
    +"}"; 

    public static void startAvroProducer() throws InterruptedException, IOException{
        Properties props = new Properties();
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"localhost:9092");
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, ByteArraySerializer.class);
        props.put(ProducerConfig.CLIENT_ID_CONFIG, "Kafka Avro Producer");

        Schema.Parser parser = new Schema.Parser();
        Schema schema = parser.parse(AvroProducer.schema);

        AvroKey<GenericRecord> k = new AvroKey<GenericRecord>();

        GenericRecord datum = new GenericData.Record(schema);
        datum.put("str1","phani");
        datum.put("str2", "kumar");
        datum.put("int1", 1);
        k.datum(datum);
        GenericDatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        Encoder e = EncoderFactory.get().binaryEncoder(os, null);
        writer.write(k.datum(), e);
        e.flush();
        byte[] bytedata = os.toByteArray();
        KafkaProducer<String,byte[]> producer = new KafkaProducer<String,byte[]>(props);
        ProducerRecord<String,byte[]> producerRec = new ProducerRecord<String, byte[]>("jason", bytedata);
        producer.send(producerRec);
        producer.close();
    }

Consumer code:

private static SparkConf sc = null;
private static JavaSparkContext jsc = null;
private static JavaStreamingContext jssc = null;

public static void startAvroConsumer() throws InterruptedException {
    sc = new SparkConf().setAppName("Spark Avro Streaming Consumer")
            .setMaster("local[*]");

    jsc = new JavaSparkContext(sc);
    jssc = new JavaStreamingContext(jsc, new Duration(200));

    Schema.Parser parser = new Schema.Parser();
    Schema schema = parser.parse(AvroProducer.schema);

    Set<String> topics = Collections.singleton("jason");
    Map<String, String> kafkaParams = new HashMap<String, String>();
    kafkaParams.put("metadata.broker.list", "localhost:9092");
    kafkaParams.put("key.deserializer",
            "org.apache.kafka.common.serialization.StringDeserializer");
    kafkaParams.put("value.deserializer",
            "org.apache.kafka.common.serialization.ByteArrayDeserializer");

    JavaPairInputDStream<String, byte[]> inputDstream = KafkaUtils
            .createDirectStream(jssc, String.class, byte[].class,
                    StringDecoder.class, DefaultDecoder.class, kafkaParams,
                    topics);

    GenericDatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(
            schema);

    inputDstream.map(message -> {
        ByteArrayInputStream bis = new ByteArrayInputStream(message._2);
        Decoder decoder = DecoderFactory.get().binaryDecoder(bis, null);
        GenericRecord record = reader.read(null, decoder);
        String str1 = getValue(record, "str1", String.class);
        String str2 = getValue(record, "str2", String.class);
        int int1 = getValue(record, "int1", Integer.class);

        return str1 + " " + str2 + " " + int1;
    }).print();;

    jssc.start();
    jssc.awaitTermination();
}

@SuppressWarnings("unchecked")
public static <T> T getValue(GenericRecord genericRecord, String name,
        Class<T> clazz) {
    Object obj = genericRecord.get(name);
    if (obj == null)
        return null;
    if (obj.getClass() == Utf8.class) {
        return (T) obj.toString();
    }
    if (obj.getClass() == Integer.class) {
        return (T) obj;
    }
    return null;
}

Exception:

Caused by: java.io.NotSerializableException: org.apache.avro.generic.GenericDatumReader
Serialization stack:
    - object not serializable (class: org.apache.avro.generic.GenericDatumReader, value: org.apache.avro.generic.GenericDatumReader@7da8db47)
    - element of array (index: 0)
    - array (class [Ljava.lang.Object;, size 1)
    - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
    - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class com.applications.streaming.consumers.AvroConsumer, functionalInterfaceMethod=org/apache/spark/api/java/function/Function.call:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic com/applications/streaming/consumers/AvroConsumer.lambda$0:(Lorg/apache/avro/generic/GenericDatumReader;Lscala/Tuple2;)Ljava/lang/String;, instantiatedMethodType=(Lscala/Tuple2;)Ljava/lang/String;, numCaptured=1])
    - writeReplace data (class: java.lang.invoke.SerializedLambda)
    - object (class com.applications.streaming.consumers.AvroConsumer$$Lambda$13/1805404637, com.applications.streaming.consumers.AvroConsumer$$Lambda$13/1805404637@aa31e58)
    - field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function)
    - object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, <function1>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
    ... 15 more

While reading through various blogs what I understood is Avro objects does not implement serializable interface. But, as per the below jira

https://issues.apache.org/jira/browse/AVRO-1502

The issue is resolved.I am still having this issue.

Is there a possible fix to this issue.

your problem is that you are referencing the following object from your lambda function

GenericDatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(
        schema);

GenericDatumReader is not serializable. You have 2 options. Move the instantiation of your object inside your map funtion (not a good option) or move this object as a static member of your class. This will force create only a new object for each executor (1 per jvm). You can create easily instance this in a static block considering that you are using a precompiled schema. Like this

static GenericDatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(new Schema.Parser().parse(AvroProducer.schema));

or

static GenericDatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(AvroProducer.$SCHEMA);

From consumer code :

 kafkaParams.put("key.deserializer",
            "org.apache.kafka.common.serialization.StringSerializer");
    kafkaParams.put("value.deserializer",
            "org.apache.kafka.common.serialization.ByteArraySerializer");

you can see you have put serializer class for deserializer key.

Deserializers to be used : ByteArrayDeserializer, StringDeserializer

A general remark : Avro data on kafka needs to implemented with some kind of schema registery service, as avro schema can evolve over time .

http://bytepadding.com/big-data/spark/avro/avro-serialization-de-serialization-using-confluent-schema-registry/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM