在一种模式下批处理Avro消息的好处？

Question

I am wondering how beneficial (performance and size-wise) it would be to batch Avro messages into one Avro message. 我想知道将Avro消息批处理为一条Avro消息有何好处（在性能和大小方面）。 It would have one schema for all the records instead of one per record. 对于所有记录，它将有一个模式，而不是每个记录一个。 (Suppose schema management is not possible, so every time we send a message, we must send the schema along with it) （假设无法进行模式管理，因此每次发送消息时，都必须将模式与它一起发送）

For example, say we have an Avro schema representing a 'person' that has 'height', 'weight' and 'age'. 例如，假设我们有一个Avro模式，表示具有“身高”，“体重”和“年龄”的“人”。 Suppose we have 10 people we want to record in Avro messages. 假设我们要在Avro消息中记录10个人。 We could either send 10 separate Avro messages, each with their schema in the metadata (taking up space), or 1 Avro message with an array of people and only one schema. 我们可以发送10条单独的Avro消息，每条消息都将其模式存储在元数据中（占用空间），或者发送1条Avro消息，其中包含一组人员并且只有一个模式。

I am wondering how impactful this compression would be - what is the relative size of the schema and is it worth it to go to the trouble of architecting this compression? 我想知道这种压缩将产生多大的影响-模式的相对大小是多少？值得为设计这种压缩而烦恼吗？ Or is it minimally effective, in which case it's easier to just send 10 separate messages? 还是效果最小，在这种情况下，仅发送10条单独的消息会更容易？

Thanks in advance. 提前致谢。 Danielle 丹妮尔

Answer 1

TL;DR: You very likely want to batch your messages, otherwise you would be better off emitting the data as JSON directly. TL; DR：您很可能希望对消息进行批处理，否则最好直接将数据作为JSON发出。

For example, let's use a Person record similar to what you suggest: 例如，让我们使用类似于您建议的“ Person记录：

{
  "name": "Person",
  "type": "record",
  "fields": [
    {"name": "height", "type": "float"},
    {"name": "weight", "type": "float"},
    {"name": "age", "type": "int"}
  ]
}

Then, without compression: 然后，不压缩：

The schema itself is ~150 bytes. 模式本身约为150个字节。
A random record (eg {"height": 213.47,"weight": 365.4,"age": 78} ) is: 随机记录（例如{"height": 213.47,"weight": 365.4,"age": 78} ）是：
- ~10 bytes when binary-encoded. 二进制编码时约10个字节。
- ~50 bytes when JSON-encoded. JSON编码时约50个字节。

So, roughly, it's only worth using binary encoding (which requires including the schema) if you batch records 5+ at a time. 因此，大致来说，如果您一次批处理5个以上的记录，则仅使用二进制编码（需要包括模式）才值得。 Compression will also probably be in favor of JSON encoding, so you'll want to batch even more. 压缩也可能会支持JSON编码，因此您将需要进行更多批处理。

Of course, all this depends on your particular schema and values. 当然，所有这些取决于您的特定架构和值。 For example if your values contain large arrays or strings, the relative cost of including the schema in each message will be smaller. 例如，如果您的值包含大数组或字符串，则在每条消息中包含模式的相对成本将较小。

在一种模式下批处理Avro消息的好处？

问题描述

1 个解决方案

解决方案1
0 2016-05-27 14:53:34

在一种模式下批处理Avro消息的好处？

问题描述

1 个解决方案

解决方案1 0 2016-05-27 14:53:34

解决方案1
0 2016-05-27 14:53:34