通过总是添加新字段来演化序列化框架（如 ProtoBuf、Thrift 等）的数据模式有什么问题？

Question

I'm writing a simple serialization framework.我正在编写一个简单的序列化框架。 My intention is not to compete with ProtoBuf, Thrift, Avro, etc., far from that.我的意图不是与 ProtoBuf、Thrift、Avro 等竞争，远非如此。 My goal is to learn.我的目标是学习。

My question is related to evolving the schema of my data objects.我的问题与发展我的数据对象的模式有关。

Let's say I have two programs, A and B, that need to exchange data and I have a data object represented by the schema below:假设我有两个程序 A 和 B，需要交换数据，并且我有一个由以下模式表示的数据对象：

public byte[] accountUser = new byte[8]; // first field

Great!伟大的！ Now I want to go ahead and add an accountId field to my data object schema:现在我想继续向我的数据对象架构添加一个 accountId 字段：

public byte[] accountUser = new byte[8]; // first field
public int accountId = -1; // second field just added (NEW FIELD)

Scenario 1:场景一：

Program A has the new schema with accountId and Program B does not.程序 A 具有带有 accountId 的新架构，而程序 B 没有。

Program A sends the data object to Program B.程序 A 将数据对象发送给程序 B。

Program B will simply read the data up to accountUser and totally ignore accountId.程序 B 将简单地读取数据到 accountUser 并完全忽略 accountId。 It knows nothing about it and it wasn't updated to use the latest data object schema with the accountId.它对此一无所知，也没有更新为使用带有 accountId 的最新数据对象模式。

Everything works!一切正常！

Scenario 2:场景2：

Program A has the old schema without accountId and Program B has the new schema with accountId.程序 A 具有没有 accountId 的旧模式，程序 B 具有带有 accountId 的新模式。

Program A sends the data object to Program B.程序 A 将数据对象发送给程序 B。

Program B will read the data up to accountUser and proceed to try to read the new accountId.程序 B 将读取数据到 accountUser 并继续尝试读取新的 accountId。 But there is nothing more to read in the data object received.但是在接收到的数据对象中没有什么可以读取的了。 No more data after accountUser. accountUser 之后没有更多数据。 So Program B simply assumes the default null value of -1 for the accountId and move on with its life.所以程序 B 简单地假设 accountId 的默认空值 -1 并继续它的生命。 I will most probably have logic to deal with a -1 accountId from legacy systems still operating with the old schema.我很可能有逻辑来处理仍然使用旧模式运行的遗留系统的 -1 accountId。

Everything works!一切正常！

So what is really the problem of this simple approach for schema evolution?那么这种简单的模式演化方法的真正问题是什么？ It is not perfect I know, but can't it be successfully used?我知道它并不完美，但它不能成功使用吗？ I just have to assume that I will never remove any field and that I will never mess with the order of the fields.我只需要假设我永远不会删除任何字段，并且我永远不会弄乱字段的顺序。 I just keep adding more fields.我只是不断添加更多字段。

Answer 1

Adding new fields by itself isn't a problem, as long as the protocol is itself field-based via some kind of header.添加新字段本身不是问题，只要协议本身通过某种报头是基于字段的。 Obviously, if it is size/blit based, there will be a problem as it will read the incorrect amount of data per record.显然，如果它是基于大小/位块的，就会出现问题，因为它会读取每条记录的错误数据量。 Adding fields is exactly how most protocols work, so it isn't a problem but the decoder does need to know, in advance, how to ignore a field that it doesn't understand.添加字段正是大多数协议的工作方式，因此这不是问题，但解码器确实需要事先知道如何忽略它不理解的字段。 Does it skip some fixed number of bytes?它是否跳过了一些固定数量的字节？ Does it look for some closing sentinel?它会寻找一些关闭哨兵吗？ Something else?还有什么？ As long as your decoder knows how to ignore every possible kind of field that it doesn't know about : you're fine.只要您的解码器知道如何忽略它不知道的每种可能的字段：您就可以了。

You also shouldn't assume simple incremental fields, IMO.您也不应该假设简单的增量字段，IMO。 I have seen, in real world scenarios, where a structure is branched in two different ways by different teams, then recombined, so every combination of我已经看到，在现实世界的场景中，一个结构由不同的团队以两种不同的方式进行分支，然后重新组合，所以每个组合

A一种
A, B甲、乙
A, C甲、丙
A, B, C甲、乙、丙

(where B and C are different sets of additional fields) are possible （其中 B 和 C 是不同的附加字段集）是可能的

I just have to assume that I will never remove any field and that I will never mess with the order of the fields.我只需要假设我永远不会删除任何字段，并且我永远不会弄乱字段的顺序。

This happens.有时候是这样的。 You need to deal with it;你需要处理它； or accept that you're solving a simpler problem, so your solution is simpler.或者接受你正在解决一个更简单的问题，所以你的解决方案更简单。

Answer 2

What is the problem of evolving a data schema of a serialization framework by always adding new fields?通过总是添加新字段来演化序列化框架的数据模式有什么问题？

This is your file.这是你的文件。 All binary data.所有二进制数据。

+------------------+
| accountUser      | 
| accountId        | 
+------------------+
| accountUser      | 
| accountId        | 
+------------------+
| accountUser      | 
| accountId        | 
+------------------+

Now have your old client (the one that does not know about accountId) read the entries.现在让您的老客户（不知道 accountId 的客户）阅读这些条目。

What is the problem of evolving a data schema of a serialization framework by always adding new fields ?通过总是添加新字段来发展序列化框架的数据模式有什么问题？

Your underlying premise is wrong.你的基本前提是错误的。 In practice it will also happen that you ...在实践中，你也会...

remove/deprecate fields删除/弃用字段
do not write optional fields that have no value不要写没有值的可选字段
need to skip unknown fields (like above)需要跳过未知字段（如上）

These frameworks solve more than just one problem.这些框架解决的问题不止一个。

Answer 3

One scenario when I've seen adding fields still cause failure was with Union.我看到添加字段仍然导致失败的一种情况是使用 Union。 Lets see with an example:让我们看一个例子：

Union {
  string firstName
  string lastName
}

Now let's say we add string middleName .现在假设我们添加string middleName 。 Program A is latest and sends Program B with variable middleName but it would fail here because program B doesn't know middleName and so when it tries to deserialize the object all the fields would be null which is not okay for a union object leading to failure程序 A 是最新的并发送带有变量 middleName 的程序 B，但它会在这里失败，因为程序 B 不知道 middleName，因此当它尝试反序列化对象时，所有字段都将为空，这对于导致失败的联合对象来说是不合适的

通过总是添加新字段来演化序列化框架（如 ProtoBuf、Thrift 等）的数据模式有什么问题？

问题描述

Scenario 1:场景一：

Scenario 2:场景2：

3 个解决方案

解决方案1
1 已采纳 2021-06-26 20:02:48

解决方案2
0 2021-07-13 22:06:18

解决方案3
0 2021-11-03 06:31:38

通过总是添加新字段来演化序列化框架（如 ProtoBuf、Thrift 等）的数据模式有什么问题？

问题描述

Scenario 1:场景一：

Scenario 2:场景2：

3 个解决方案

解决方案1 1 已采纳 2021-06-26 20:02:48

解决方案2 0 2021-07-13 22:06:18

解决方案3 0 2021-11-03 06:31:38

解决方案1
1 已采纳 2021-06-26 20:02:48

解决方案2
0 2021-07-13 22:06:18

解决方案3
0 2021-11-03 06:31:38