What is the problem of evolving a data schema of a serialization framework (like ProtoBuf, Thrift, etc.) by always adding new fields?

Question

I'm writing a simple serialization framework. My intention is not to compete with ProtoBuf, Thrift, Avro, etc., far from that. My goal is to learn.

My question is related to evolving the schema of my data objects.

Let's say I have two programs, A and B, that need to exchange data and I have a data object represented by the schema below:

public byte[] accountUser = new byte[8]; // first field

Great! Now I want to go ahead and add an accountId field to my data object schema:

public byte[] accountUser = new byte[8]; // first field
public int accountId = -1; // second field just added (NEW FIELD)

Scenario 1:

Program A has the new schema with accountId and Program B does not.

Program A sends the data object to Program B.

Program B will simply read the data up to accountUser and totally ignore accountId. It knows nothing about it and it wasn't updated to use the latest data object schema with the accountId.

Everything works!

Scenario 2:

Program A has the old schema without accountId and Program B has the new schema with accountId.

Program A sends the data object to Program B.

Program B will read the data up to accountUser and proceed to try to read the new accountId. But there is nothing more to read in the data object received. No more data after accountUser. So Program B simply assumes the default null value of -1 for the accountId and move on with its life. I will most probably have logic to deal with a -1 accountId from legacy systems still operating with the old schema.

Everything works!

So what is really the problem of this simple approach for schema evolution? It is not perfect I know, but can't it be successfully used? I just have to assume that I will never remove any field and that I will never mess with the order of the fields. I just keep adding more fields.

Answer 1

Adding new fields by itself isn't a problem, as long as the protocol is itself field-based via some kind of header. Obviously, if it is size/blit based, there will be a problem as it will read the incorrect amount of data per record. Adding fields is exactly how most protocols work, so it isn't a problem but the decoder does need to know, in advance, how to ignore a field that it doesn't understand. Does it skip some fixed number of bytes? Does it look for some closing sentinel? Something else? As long as your decoder knows how to ignore every possible kind of field that it doesn't know about : you're fine.

You also shouldn't assume simple incremental fields, IMO. I have seen, in real world scenarios, where a structure is branched in two different ways by different teams, then recombined, so every combination of

A
A, B
A, C
A, B, C

(where B and C are different sets of additional fields) are possible

I just have to assume that I will never remove any field and that I will never mess with the order of the fields.

This happens. You need to deal with it; or accept that you're solving a simpler problem, so your solution is simpler.

Answer 2

What is the problem of evolving a data schema of a serialization framework by always adding new fields?

This is your file. All binary data.

+------------------+
| accountUser      | 
| accountId        | 
+------------------+
| accountUser      | 
| accountId        | 
+------------------+
| accountUser      | 
| accountId        | 
+------------------+

Now have your old client (the one that does not know about accountId) read the entries.

What is the problem of evolving a data schema of a serialization framework by always adding new fields ?

Your underlying premise is wrong. In practice it will also happen that you ...

remove/deprecate fields
do not write optional fields that have no value
need to skip unknown fields (like above)

These frameworks solve more than just one problem.

Answer 3

One scenario when I've seen adding fields still cause failure was with Union. Lets see with an example:

Union {
  string firstName
  string lastName
}

Now let's say we add string middleName . Program A is latest and sends Program B with variable middleName but it would fail here because program B doesn't know middleName and so when it tries to deserialize the object all the fields would be null which is not okay for a union object leading to failure

What is the problem of evolving a data schema of a serialization framework (like ProtoBuf, Thrift, etc.) by always adding new fields?

Question

Scenario 1:

Scenario 2:

3 answers

solution1
1 ACCPTED 2021-06-26 20:02:48

solution2
0 2021-07-13 22:06:18

solution3
0 2021-11-03 06:31:38

What is the problem of evolving a data schema of a serialization framework (like ProtoBuf, Thrift, etc.) by always adding new fields?

Question

Scenario 1:

Scenario 2:

3 answers

solution1 1 ACCPTED 2021-06-26 20:02:48

solution2 0 2021-07-13 22:06:18

solution3 0 2021-11-03 06:31:38

solution1
1 ACCPTED 2021-06-26 20:02:48

solution2
0 2021-07-13 22:06:18

solution3
0 2021-11-03 06:31:38