简体   繁体   中英

Deserializing Byte Buffer to Avro

I would like to deserialize data from a byte array to an Avro-generated Class.

The data is coming from a file mapping which is written in another process written in C++, which I can't control. Essentially the data comes from a bigger structure where every item is just cast to byte and then put into an array. For example

struct Sample {
    int firstInt;
    float firtFloat;
    int secondInt;
    ...
}

the first 4 bytes are firstInt data, the next 4 would be the float, and so on. When I try to read the bytes with

AvroClassName.fromByteBuffer(bBuffer(0, 500));

I get an error:

Exception in thread "main" org.apache.avro.message.BadHeaderException: Unrecognized header bytes: 0x5A 0x03

I assume this happens because Avro expects the buffer data to be in a specific format serialized by Avro itself.

I could probably use some kind of builder where I manually map every part of the buffer to the Avro objects fields, but because there are several hundred fields in this class, this would not be the best strategy.

Is there any method that could deserialize the data to the Avro-generated Object? The order of the fields in the Avro Class is the same as they are occurring in the Byte Buffer.

The binary data you describe will not deserialize to avro.

I assume this happens because Avro expects the buffer data to be in a specific format serialized by Avro itself.

You've answered your own question, but it's interesting to think about why there's unlikely to be the general purpose utility method that you desire.

  1. A deserializer would need to know the endianness of the runtime that serialized fields by casting to bytes. https://en.wikipedia.org/wiki/Endianness . In the case of C++, this would likely be the endianess of the machine on which the code runs. There's usually two strategies for serialization when this is a concern. Serializer fixes the endianness of the serialized bytes (eg network byte order), or includes an indicator commonly known as a 'byte order mark' that allows the deserializer to correctly infer the endianness of the serializing architecture and deserialize correctly. Your question suggests neither of these are present in the serialized data.
  2. The C++ standard defines only the minimum size of numeric datatypes. This means that the serialized 'int' fields could, in principle, be any size 32-bits or more.

These two ambiguities alone might help explain why your binary data will need some special handling. It might be that you can address these points based your knowledge of the serializing machine and/or compiler. However, this is typically outside the scope of a general serialization library that are normally limited to serializing and deserializing their own binary data formats.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM