简体   繁体   English

Avro架构演变是否需要访问新旧架构?

[英]Does Avro schema evolution require access to both old and new schemas?

If I serialize an object using a schema version 1, and later update the schema to version 2 (say by adding a field) - am I required to use schema version 1 when later deserializing the object? 如果我使用模式版本1序列化对象,然后将模式更新为版本2(例如通过添加字段) - 我是否需要在稍后反序列化对象时使用模式版本1? Ideally I would like to just use schema version 2 and have the deserialized object have the default value for the field that was added to the schema after the object was originally serialized. 理想情况下,我只想使用模式版本2,并且反序列化对象具有在最初序列化对象后添加到模式的字段的默认值。

Maybe some code will explain better... 也许一些代码会更好地解释......

schema1: schema1:

{"type": "record",
 "name": "User",
 "fields": [
  {"name": "firstName", "type": "string"}
 ]}

schema2: SCHEMA2:

{"type": "record",
 "name": "User",
 "fields": [
  {"name": "firstName", "type": "string"},
  {"name": "lastName", "type": "string", "default": ""}
 ]}

using the generic non-code-generation approach: 使用通用的非代码生成方法:

// serialize
ByteArrayOutputStream out = new ByteArrayOutputStream();
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
GenericDatumWriter writer = new GenericDatumWriter(schema1);
GenericRecord datum = new GenericData.Record(schema1);
datum.put("firstName", "Jack");
writer.write(datum, encoder);
encoder.flush();
out.close();
byte[] bytes = out.toByteArray();

// deserialize
// I would like to not have any reference to schema1 below here
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(schema2);
Decoder decoder = DecoderFactory.get().binaryDecoder(bytes, null);
GenericRecord result = reader.read(null, decoder);

results in an EOFException. 导致EOFException。 Using the jsonEncoder results in an AvroTypeException. 使用jsonEncoder导致AvroTypeException。

I know it will work if I pass both schema1 and schema2 to the GenericDatumReader constructor, but I'd like to not have to keep a repository of all previous schemas and also somehow keep track of which schema was used to serialize each particular object. 我知道如果我将schema1和schema2都传递给GenericDatumReader构造函数,它将会工作,但我不想保留所有以前的模式的存储库,并且还要以某种方式跟踪用于序列化每个特定对象的模式。

I also tried the code-gen approach, first serializing to a file using the User class generated from schema1: 我也尝试了代码生成方法,首先使用schema1生成的User类序列化到一个文件:

User user = new User();
user.setFirstName("Jack");
DatumWriter<User> writer = new SpecificDatumWriter<User>(User.class);
FileOutputStream out = new FileOutputStream("user.avro");
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(user, encoder);
encoder.flush();
out.close();

Then updating the schema to version 2, regenerating the User class, and attempting to read the file: 然后将架构更新到版本2,重新生成User类,并尝试读取该文件:

DatumReader<User> reader = new SpecificDatumReader<User>(User.class);
FileInputStream in = new FileInputStream("user.avro");
Decoder decoder = DecoderFactory.get().binaryDecoder(in, null);
User user = reader.read(null, decoder);

but it also results in an EOFException. 但它也会导致EOFException。

Just for comparison's sake, what I'm trying to do seems to work with protobufs... 仅仅为了比较,我正在尝试做的似乎与protobufs一起工作......

format: 格式:

option java_outer_classname = "UserProto";
message User {
    optional string first_name = 1;
}

serialize: 连载:

UserProto.User.Builder user = UserProto.User.newBuilder();
user.setFirstName("Jack");
FileOutputStream out = new FileOutputStream("user.data");
user.build().writeTo(out);

add optional last_name to format, regen UserProto, and deserialize: 添加可选的last_name以格式化,重新生成UserProto和反序列化:

FileInputStream in = new FileInputStream("user.data");
UserProto.User user = UserProto.User.parseFrom(in);

as expected, user.getLastName() is the empty string. 正如预期的那样, user.getLastName()是空字符串。

Can something like this be done with Avro? 这样的事情可以用Avro完成吗?

Avro and Protocol Buffers have different approaches to handling versioning, and which approach is better depends on your use case. Avro和Protocol Buffers有不同的方法来处理版本控制,哪种方法更好取决于您的用例。

In Protocol Buffers you have to explicitly tag every field with a number, and those numbers are stored along with the fields' values in the binary representation. 在协议缓冲区中,您必须使用数字显式标记每个字段,并将这些数字与字段的值一起存储在二进制表示中。 Thus, as long as you never change the meaning of a number in a subsequent schema version, you can still decode a record encoded in a different schema version. 因此,只要您在后续架构版本中永远不更改数字的含义,您仍然可以解码以不同架构版本编码的记录。 If the decoder sees a tag number that it doesn't recognise, it can simply skip it. 如果解码器看到它无法识别的标签号,则可以简单地跳过它。

Avro takes a different approach: there are no tag numbers, instead the binary layout is completely determined by the program doing the encoding — this is the writer's schema. Avro采用了不同的方法:没有标签号,而二进制布局完全取决于执行编码的程序 - 这是编写器的架构。 (A record's fields are simply stored one after another in the binary encoding, without any tagging or separator, and the order is determined by the writer's schema.) This makes the encoding more compact, and saves you from having to manually maintain tags in the schema. (记录的字段只是在二进制编码中一个接一个地存储,没有任何标记或分隔符,并且顺序由编写者的模式决定。)这使得编码更紧凑,并且使您不必手动维护标记。架构。 But it does mean that for reading, you have to know the exact schema with which the data was written, or you won't be able to make sense of it. 但它确实意味着,对于阅读,您必须知道数据写入的确切模式,否则您将无法理解它。

If knowing the writer's schema is essential for decoding Avro, the reader's schema is a layer of niceness on top of it. 如果知道编写器的模式对于解码Avro至关重要,那么读者的模式就是它的一层好处。 If you're doing code generation in a program that needs to read Avro data, you can do the codegen off the reader's schema, which saves you from having to regenerate it every time the writer's schema changes (assuming it changes in a way that can be resolved). 如果您在需要读取Avro数据的程序中进行代码生成,则可以从读取器的模式中执行codegen,这样您就不必在每次编写器的模式更改时重新生成它(假设它以可以的方式更改)得到解决)。 But it doesn't save you from having to know the writer's schema. 但它不会让你不必知道作者的架构。

Pros & Cons 优点缺点

Avro's approach is good in an environment where you have lots of records that are known to have the exact same schema version, because you can just include the schema in the metadata at the beginning of the file, and know that the next million records can all be decoded using that schema. Avro的方法在一个环境中是很好的,在这种环境中你有许多已知具有完全相同的模式版本的记录,因为你可以只在文件开头的元数据中包含模式,并且知道下一百万条记录都可以使用该架构解码。 This happens a lot in a MapReduce context, which explains why Avro came out of the Hadoop project. 这在MapReduce上下文中发生了很多,这解释了为什么Avro从Hadoop项目中走出来。

Protocol Buffers' approach is probably better for RPC, where individual objects are being sent over the network (as request parameters or return value). 协议缓冲区的方法可能更适用于RPC,其中单个对象通过网络发送(作为请求参数或返回值)。 If you use Avro here, you may have different clients and different servers all with different schema versions, so you'd have to tag every binary-encoded blob with the Avro schema version it's using, and maintain a registry of schemas. 如果您在此处使用Avro,则可能有不同的客户端和不同的服务器都具有不同的架构版本,因此您必须使用它正在使用的Avro架构版本标记每个二进制编码的blob,并维护架构的注册表。 At that point you might as well have used Protocol Buffers' built-in tagging. 此时,您可能还使用了Protocol Buffers的内置标记。

To do what you are trying to do you need to make the last_name field optional, by allowing null values. 要执行您要执行的操作,您需要通过允许空值使last_name字段成为可选字段。 The type for last_name should be ["null", "string"] instead of "string" last_name的类型应为[“null”,“string”]而不是“string”

I have tried to circumvent this problem. 我试图绕过这个问题。 I am putting it here: 我把它放在这里:

I have also tried using two schemas one schema just an addition of another column to the other schema using the refection API of Avro. 我还尝试使用两个模式一个模式,只是使用Avro的refection API将另一个列添加到另一个模式。 I have the following schema: 我有以下架构:

Employee (having name, age, ssn)
ExtendedEmployee (extending Employee and having gender column)

I am assuming on file which had the Employee objects earlier now also has the ExtendedEmployee object and I tried to read that file as : 我假设现在早些时候有Employee对象的文件也有ExtendedEmployee对象,我试图将该文件读取为:

    RecordHandler rh = new RecordHandler();
    if (rh.readObject(employeeSchema, dbLocation) instanceof Employee) {
        Employee e = (Employee) rh.readObject(employeeSchema, dbLocation);
        System.out.print(e.toString());
    } else if (rh.readObject(schema, dbLocation) instanceof ExtendedEmployee) {
        ExtendedEmployee e = (ExtendedEmployee) rh.readObject(schema, dbLocation);
        System.out.print(e.toString());
    }

This solves the problem here. 这解决了这里的问题。 However, I would love to know if there is an API wherein we can give the ExtendedEmployee schema to read the objects of Employee as well. 但是,我很想知道是否有一个API,我们可以让ExtendedEmployee模式读取Employee的对象。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM