简体   繁体   中英

Performance of using different Avro types for sending messages to Kafka – SpecificRecordBase vs. GenericRecord with Schema Registry

I'm trying to find some info about the performance and (dis)advantages of using two different Avro types for sending Kafka messages. According to my research one can create an avro-based Kafka message's payload as:

EITHER :

GenericRecord whose instance can be created by calling new GenericData.Record and passing a schema read from the Schema Registry as a parameter:

Roughly:

private CachedSchemaRegistryClient schemaRegistryClient;
private Schema valueSchema;
// Read a schema
//…
this.valueSchema = schemaRegistryClient.getBySubjectAndID("TestTopic-value",1);
// Define a generic record according to the loaded schema

GenericData.Record record = new GenericData.Record(valueSchema);
// Send to kafka

ListenableFuture<SendResult<String, GenericRecord>> res;
res = avroKafkaTemplate
        .send(MessageBuilder
                .withPayload(record)
                .setHeader(KafkaHeaders.TOPIC, TOPIC)
                .setHeader(KafkaHeaders.MESSAGE_KEY, record.get("id"))
                .build());

OR :

A class that extends the SpecificRecordBase and is generated with the help of Maven (from a file containing an Avro schema)

/..
public class MyClass extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord

/..
MyClass myAvroClass = new MyClass();

ListenableFuture<SendResult<String, MyClass>> res;
res = avroKafkaTemplate
        .send(MessageBuilder
                .withPayload(myAvroClass)
                .setHeader(KafkaHeaders.TOPIC, TOPIC)
                .setHeader(KafkaHeaders.MESSAGE_KEY, myAvroClass.getId())
                .build());

When a piece of code that contains an instance of a class that extends GenericRecord is debugged, one can see that there's a schema included.

On that account I have a few questions:

  1. If I send a GenericRecord instance to Kafka, is the underlying schema also sent?
    If no, when is it dropped? Which class / method is responsible for extracting bytes from GenericRecord and dropping the underlying schema so that it is not sent together with the payload? If yes, what's the point of the schema registry whatsoever?

  2. In case of a class that extends the SpecificRecord , the underlying schema is also sent, isn't it? It means that, if I took a function which receives a Kafka message and counts the number of its bytes, I should expect more bytes in a specific-record-message than in a generic-record-message, right?

  3. A SpecificRecord instance gives me more control, and the usage is less error-prone. If a schema is not sent with GenericRecord and is with SpecificRecord , then we have a trade-off. On the one hand (SpecificRecord), there is a simplicity of usage since clear API is available (one doesn't have to know all fields by heart, and write get("X"), get("Y") etc.), on the other hand the payload's size increases since the schema has to be sent with it. If I have a relatively big schema (50 fields), I should opt for sending GenericRecords with the help of the Schema Registry, otherwise the performance will be impacted negatively because the schema has to be sent with every message, correct?

Schemas are sent and cached by the producer in both cases of generic or specific

Performance wise, while I've not benchmarked it, I'd estimate serialization time is approximately the same for both while deserialization would be quicker for Generic because the field access and type casting would be deferred to your own code rather than validated for each field

Note: There's also ReflectData records which could be slower due to using reflection

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM