简体   繁体   English

在一种模式下批处理Avro消息的好处?

[英]Benefits of batching Avro messages under one schema?

I am wondering how beneficial (performance and size-wise) it would be to batch Avro messages into one Avro message. 我想知道将Avro消息批处理为一条Avro消息有何好处(在性能和大小方面)。 It would have one schema for all the records instead of one per record. 对于所有记录,它将有一个模式,而不是每个记录一个。 (Suppose schema management is not possible, so every time we send a message, we must send the schema along with it) (假设无法进行模式管理,因此每次发送消息时,都必须将模式与它一起发送)

For example, say we have an Avro schema representing a 'person' that has 'height', 'weight' and 'age'. 例如,假设我们有一个Avro模式,表示具有“身高”,“体重”和“年龄”的“人”。 Suppose we have 10 people we want to record in Avro messages. 假设我们要在Avro消息中记录10个人。 We could either send 10 separate Avro messages, each with their schema in the metadata (taking up space), or 1 Avro message with an array of people and only one schema. 我们可以发送10条单独的Avro消息,每条消息都将其模式存储在元数据中(占用空间),或者发送1条Avro消息,其中包含一组人员并且只有一个模式。

I am wondering how impactful this compression would be - what is the relative size of the schema and is it worth it to go to the trouble of architecting this compression? 我想知道这种压缩将产生多大的影响-模式的相对大小是多少?值得为设计这种压缩而烦恼吗? Or is it minimally effective, in which case it's easier to just send 10 separate messages? 还是效果最小,在这种情况下,仅发送10条单独的消息会更容易?

Thanks in advance. 提前致谢。 Danielle 丹妮尔

TL;DR: You very likely want to batch your messages, otherwise you would be better off emitting the data as JSON directly. TL; DR:您很可能希望对消息进行批处理,否则最好直接将数据作为JSON发出。

For example, let's use a Person record similar to what you suggest: 例如,让我们使用类似于您建议的“ Person记录:

{
  "name": "Person",
  "type": "record",
  "fields": [
    {"name": "height", "type": "float"},
    {"name": "weight", "type": "float"},
    {"name": "age", "type": "int"}
  ]
}

Then, without compression: 然后,不压缩:

  • The schema itself is ~150 bytes. 模式本身约为150个字节。
  • A random record (eg {"height": 213.47,"weight": 365.4,"age": 78} ) is: 随机记录(例如{"height": 213.47,"weight": 365.4,"age": 78} )是:
    • ~10 bytes when binary-encoded. 二进制编码时约10个字节。
    • ~50 bytes when JSON-encoded. JSON编码时约50个字节。

So, roughly, it's only worth using binary encoding (which requires including the schema) if you batch records 5+ at a time. 因此,大致来说,如果您一次批处理5个以上的记录,则仅使用二进制编码(需要包括模式)才值得。 Compression will also probably be in favor of JSON encoding, so you'll want to batch even more. 压缩也可能会支持JSON编码,因此您将需要进行更多批处理。

Of course, all this depends on your particular schema and values. 当然,所有这些取决于您的特定架构和值。 For example if your values contain large arrays or strings, the relative cost of including the schema in each message will be smaller. 例如,如果您的值包含大数组或字符串,则在每条消息中包含模式的相对成本将较小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 用于线程消息的MongoDB / Mongoose模式(有效) - MongoDB / Mongoose Schema for Threaded Messages (Efficiently) 在负载下发送WCF消息被延迟 - Sending WCF messages being delayed under load 几个服务主机有什么好处? 一个ServiceHost是否支持一个端点上的多个同时连接? - What are the benefits for several servicehosts? Does one ServiceHost support several simultaneous connections on one endpoint? 用属性或对象替换“一次性使用”功能有明显的好处吗? - Are there any clear benefits from replacing a 'one time use' function with a property or object 在Meteor中值得批量订阅吗? - Is it worth batching subscriptions in Meteor? 批量生成http响应 - Batching generation of http responses OpenGL批处理和禁用对象 - OpenGL batching and disabling objects 通过 HTTP2 批处理请求 - Batching requests over HTTP2 在OpenGL ES中批处理多个矩形 - Batching Multiple Rectangles in OpenGL ES 如果我正在访问的字段是复合索引的一部分,我是否仍然可以通过引用其中一个字段来获得该索引的好处? - If a field I'm accessing is part of a compound index do I still get the benefits of that index just referencing one of the fields?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM