简体   繁体   English

如何使用Avro二进制编码器对Kafka消息进行编码/解码?

[英]How to encode/decode Kafka messages using Avro binary encoder?

I'm trying to use Avro for messages being read from/written to Kafka. 我正在尝试使用Avro来读取/写入Kafka的消息。 Does anyone have an example of using the Avro binary encoder to encode/decode data that will be put on a message queue? 有没有人有一个使用Avro二进制编码器编码/解码将被放在消息队列中的数据的例子?

I need the Avro part more than the Kafka part. 我需要Avro部件而不是Kafka部件。 Or, perhaps I should look at a different solution? 或者,也许我应该看一个不同的解决方案? Basically, I'm trying to find a more efficient solution to JSON with regards to space. 基本上,我正试图在空间方面找到更有效的JSON解决方案。 Avro was just mentioned since it can be more compact than JSON. 刚刚提到Avro,因为它比JSON更紧凑。

This is a basic example. 这是一个基本的例子。 I have not tried it with multiple partitions/topics. 我没有尝试过多个分区/主题。

//Sample producer code //示例生产者代码

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.*;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.avro.specific.SpecificDatumWriter;
import org.apache.commons.codec.DecoderException;
import org.apache.commons.codec.binary.Hex;
import kafka.javaapi.producer.Producer;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Properties;


public class ProducerTest {

    void producer(Schema schema) throws IOException {

        Properties props = new Properties();
        props.put("metadata.broker.list", "0:9092");
        props.put("serializer.class", "kafka.serializer.DefaultEncoder");
        props.put("request.required.acks", "1");
        ProducerConfig config = new ProducerConfig(props);
        Producer<String, byte[]> producer = new Producer<String, byte[]>(config);
        GenericRecord payload1 = new GenericData.Record(schema);
        //Step2 : Put data in that genericrecord object
        payload1.put("desc", "'testdata'");
        //payload1.put("name", "अasa");
        payload1.put("name", "dbevent1");
        payload1.put("id", 111);
        System.out.println("Original Message : "+ payload1);
        //Step3 : Serialize the object to a bytearray
        DatumWriter<GenericRecord>writer = new SpecificDatumWriter<GenericRecord>(schema);
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out, null);
        writer.write(payload1, encoder);
        encoder.flush();
        out.close();

        byte[] serializedBytes = out.toByteArray();
        System.out.println("Sending message in bytes : " + serializedBytes);
        //String serializedHex = Hex.encodeHexString(serializedBytes);
        //System.out.println("Serialized Hex String : " + serializedHex);
        KeyedMessage<String, byte[]> message = new KeyedMessage<String, byte[]>("page_views", serializedBytes);
        producer.send(message);
        producer.close();

    }


    public static void main(String[] args) throws IOException, DecoderException {
        ProducerTest test = new ProducerTest();
        Schema schema = new Schema.Parser().parse(new File("src/test_schema.avsc"));
        test.producer(schema);
    }
}

//Sample consumer code //消费者代码示例

Part 1 : Consumer group code : as you can have more than multiple consumers for multiple partitions/ topics. 第1部分:使用者组代码:因为您可以为多个分区/主题拥有多个使用者。

import kafka.consumer.ConsumerConfig;
import kafka.consumer.KafkaStream;
import kafka.javaapi.consumer.ConsumerConnector;

import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.concurrent.Executor;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

/**
 * Created by  on 9/1/15.
 */
public class ConsumerGroupExample {
   private final ConsumerConnector consumer;
   private final String topic;
   private ExecutorService executor;

   public ConsumerGroupExample(String a_zookeeper, String a_groupId, String a_topic){
      consumer = kafka.consumer.Consumer.createJavaConsumerConnector(
              createConsumerConfig(a_zookeeper, a_groupId));
      this.topic = a_topic;
   }

   private static ConsumerConfig createConsumerConfig(String a_zookeeper, String a_groupId){
       Properties props = new Properties();
       props.put("zookeeper.connect", a_zookeeper);
       props.put("group.id", a_groupId);
       props.put("zookeeper.session.timeout.ms", "400");
       props.put("zookeeper.sync.time.ms", "200");
       props.put("auto.commit.interval.ms", "1000");

       return new ConsumerConfig(props);
   }

    public void shutdown(){
         if (consumer!=null) consumer.shutdown();
        if (executor!=null) executor.shutdown();
        System.out.println("Timed out waiting for consumer threads to shut down, exiting uncleanly");
        try{
          if(!executor.awaitTermination(5000, TimeUnit.MILLISECONDS)){

          }
        }catch(InterruptedException e){
            System.out.println("Interrupted");
        }

    }


    public void run(int a_numThreads){
        //Make a map of topic as key and no. of threads for that topic
        Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
        topicCountMap.put(topic, new Integer(a_numThreads));
        //Create message streams for each topic
        Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
        List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);

        //initialize thread pool
        executor = Executors.newFixedThreadPool(a_numThreads);
        //start consuming from thread
        int threadNumber = 0;
        for (final KafkaStream stream : streams) {
            executor.submit(new ConsumerTest(stream, threadNumber));
            threadNumber++;
        }
    }
    public static void main(String[] args) {
        String zooKeeper = args[0];
        String groupId = args[1];
        String topic = args[2];
        int threads = Integer.parseInt(args[3]);

        ConsumerGroupExample example = new ConsumerGroupExample(zooKeeper, groupId, topic);
        example.run(threads);

        try {
            Thread.sleep(10000);
        } catch (InterruptedException ie) {

        }
        example.shutdown();
    }


}

Part 2 : Indiviual consumer that actually consumes the messages. 第2部分:实际消费消息的单个消费者。

import kafka.consumer.ConsumerIterator;
import kafka.consumer.KafkaStream;
import kafka.message.MessageAndMetadata;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.IndexedRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.Decoder;
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.commons.codec.binary.Hex;

import java.io.File;
import java.io.IOException;

public class ConsumerTest implements Runnable{

    private KafkaStream m_stream;
    private int m_threadNumber;

    public ConsumerTest(KafkaStream a_stream, int a_threadNumber) {
        m_threadNumber = a_threadNumber;
        m_stream = a_stream;
    }

    public void run(){
        ConsumerIterator<byte[], byte[]>it = m_stream.iterator();
        while(it.hasNext())
        {
            try {
                //System.out.println("Encoded Message received : " + message_received);
                //byte[] input = Hex.decodeHex(it.next().message().toString().toCharArray());
                //System.out.println("Deserializied Byte array : " + input);
                byte[] received_message = it.next().message();
                System.out.println(received_message);
                Schema schema = null;
                schema = new Schema.Parser().parse(new File("src/test_schema.avsc"));
                DatumReader<GenericRecord> reader = new SpecificDatumReader<GenericRecord>(schema);
                Decoder decoder = DecoderFactory.get().binaryDecoder(received_message, null);
                GenericRecord payload2 = null;
                payload2 = reader.read(null, decoder);
                System.out.println("Message received : " + payload2);
            }catch (Exception e) {
                e.printStackTrace();
                System.out.println(e);
            }
        }

    }


}

Test AVRO schema : 测试AVRO架构:

{
    "namespace": "xyz.test",
     "type": "record",
     "name": "payload",
     "fields":[
         {
            "name": "name", "type": "string"
         },
         {
            "name": "id",  "type": ["int", "null"]
         },
         {
            "name": "desc", "type": ["string", "null"]
         }
     ]
}

Important things to note are : 需要注意的重要事项是:

  1. Youll need the standard kafka and avro jars to run this code out of the box. 你需要标准的kafka和avro jar才能运行这个代码。

  2. Is very important props.put("serializer.class", "kafka.serializer.DefaultEncoder"); 非常重要的props.put(“serializer.class”,“kafka.serializer.DefaultEncoder”); Don t use stringEncoder as that won t work if you are sending a byte array as message. 如果t use stringEncoder as that won字节数组作为消息发送t use stringEncoder as that won

  3. You can convert the byte[] to a hex string and send that and on the consumer reconvert hex string to byte[] and then to the original message. 您可以将byte []转换为十六进制字符串,然后将该字符串和使用者重新发送十六进制字符串发送到byte [],然后发送到原始消息。

  4. Run the zookeeper and the broker as mentioned here :- http://kafka.apache.org/documentation.html#quickstart and create a topic called "page_views" or whatever you want. 运行如下所述的zookeeper和代理: - http://kafka.apache.org/documentation.html#quickstart并创建一个名为“page_views”的主题或任何你想要的主题。

  5. Run the ProducerTest.java and then the ConsumerGroupExample.java and see the avro data being produced and consumed. 运行ProducerTest.java,然后运行ConsumerGroupExample.java,查看正在生成和使用的avro数据。

I finally remembered to ask the Kafka mailing list and got the following as an answer, which worked perfectly. 我终于记得要问Kafka邮件列表并得到以下答案,这完美地运作了。

Yes, you can send messages as byte arrays. 是的,您可以将消息作为字节数组发送。 If you look at the constructor of the Message class, you will see - 如果你看一下Message类的构造函数,你会看到 -

def this(bytes: Array[Byte]) def this(bytes:Array [Byte])

Now, looking at the Producer send() API - 现在,看一下Producer的send()API -

def send(producerData: ProducerData[K,V]*) def send(producerData:ProducerData [K,V] *)

You can set V to be of type Message and K to what you want your key to be. 您可以将V设置为Message和K类型,使其成为您想要的密钥。 If you don't care about partitioning using a key, then set that to Message type as well. 如果您不关心使用密钥进行分区,那么也将其设置为Message类型。

Thanks, Neha 谢谢,Neha

If you want to get a byte array from an Avro message (the kafka part is already answered), use the binary encoder: 如果要从Avro消息中获取字节数组(已经回答了kafka部分),请使用二进制编码器:

    GenericDatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema); 
    ByteArrayOutputStream os = new ByteArrayOutputStream(); 
    try {
        Encoder e = EncoderFactory.get().binaryEncoder(os, null); 
        writer.write(record, e); 
        e.flush(); 
        byte[] byteData = os.toByteArray(); 
    } finally {
        os.close(); 
    }

Updated Answer. 更新的答案。

Kafka has an Avro serializer/deserializer with Maven (SBT formatted) coordinates: Kafka有一个带Maven(SBT格式)坐标的Avro序列化器/解串器:

  "io.confluent" % "kafka-avro-serializer" % "3.0.0"

You pass an instance of KafkaAvroSerializer into the KafkaProducer constructor. 您将KafkaAvroSerializer的实例传递给KafkaProducer构造函数。

Then you can create Avro GenericRecord instances, and use those as values inside Kafka ProducerRecord instances which you can send with KafkaProducer. 然后,您可以创建Avro GenericRecord实例,并将其用作Kafka ProducerRecord实例中的值,您可以使用KafkaProducer发送这些实例。

On the Kafka consumer side, you use KafkaAvroDeserializer and KafkaConsumer. 在Kafka消费者方面,您使用KafkaAvroDeserializer和KafkaConsumer。

Instead of Avro, you could also simply consider compressing data; 您可以简单地考虑压缩数据而不是Avro; either with gzip (good compression, higher cpu) or LZF or Snappy (much faster, bit slower compression). 使用gzip(良好的压缩,更高的CPU)或LZF或Snappy(更快,压缩速度稍慢)。

Or alternatively there is also Smile binary JSON , supported in Java by Jackson (with this extension ): it is compact binary format, and much easier to use than Avro: 或者也有Smile二进制JSON ,由Jackson支持Java(带有此扩展名 ):它是紧凑的二进制格式,比Avro更容易使用:

ObjectMapper mapper = new ObjectMapper(new SmileFactory());
byte[] serialized = mapper.writeValueAsBytes(pojo);
// or back
SomeType pojo = mapper.readValue(serialized, SomeType.class);

basically same code as with JSON, except for passing different format factory. 基本上与JSON相同的代码,除了传递不同的格式工厂。 From data size perspective, whether Smile or Avro is more compact depends on details of use case; 从数据大小的角度来看,Smile或Avro是否更紧凑取决于用例的细节; but both are more compact than JSON. 但两者都比JSON更紧凑。

Benefit there is that this works fast with both JSON and Smile, with same code, using just POJOs. 这样做的好处是JSON和Smile都可以使用相同的代码,仅使用POJO。 Compared to Avro which either requires code generation, or lots of manual code to pack and unpack GenericRecord s. 与需要代码生成的Avro或许多手动代码打包GenericRecord压缩GenericRecord

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM