简体繁体 English

性能权衡

[英]Storage cost / supportability / performance tradeoffs using compact attributes in DynamoDB

原文 2022-09-17 12:22:55 8 1 amazon-dynamodb/ amazon-dynamodb-index

I'm working on large scale component that generates unique/opaque tokens representing business entities.我正在研究生成代表业务实体的唯一/不透明令牌的大规模组件。 Over time there will be many billions of these records, but for the first year we're not expecting growth to exceed more than 2 billion individual items (probably less than 500 million).随着时间的推移，这些记录将有数十亿条，但我们预计第一年的增长不会超过 20 亿条（可能少于 5 亿条）。

The system itself is horizontally scaled but needs token generation to be idempotent;系统本身是水平扩展的，但需要代币生成是幂等的； data integrity is maintained by using a contained but reasonably complex combination of transactional writes with embedded condition expressions AND standalone condition check write items.通过使用包含但相当复杂的事务写入与嵌入式条件表达式和独立条件检查写入项的组合来维护数据完整性。

The tokens themselves are UUIDs, and 'being efficient' are persisted as Binary attribute values (16 bytes) rather than the string representation (36 bytes), however the downside is that the data doesn't visualise nicely in query consoles making support hard if we encounter any bugs and/or broken data.令牌本身是 UUID，并且“高效”作为二进制属性值（16 字节）而不是字符串表示形式（36 字节）持久化，但缺点是数据在查询控制台中不能很好地可视化，因此很难支持我们遇到任何错误和/或损坏的数据。 Note there is no extra code complexity since we implement attributevalue.Marshaler interface to bind UUID (language) types to DynamoDB Binary attributes, and similarly do the same for any composite attributes.请注意，没有额外的代码复杂性，因为我们实现了attributevalue.Marshaler接口来将UUID （语言）类型绑定到 DynamoDB 二进制属性，并且类似地对任何复合属性执行相同的操作。

My question relates to (mostly) data size/saving.我的问题与（主要）数据大小/保存有关。 Since the tokens are the partition keys, and some mapping columns are [token] -> [other token composite attributes] , for example two UUIDs concatenated together into 32 bytes.由于令牌是分区键，并且一些映射列是[token] -> [other token composite attributes] ，例如两个 UUID 连接在一起成为 32 个字节。

I wanted to keep really tight control over storage costs knowing that, over time, we will be spending ~$0.25/GB per month for this.我想严格控制存储成本，因为我知道随着时间的推移，我们每个月将为此花费约 0.25 美元/GB。 My question is really three parts:我的问题实际上是三个部分：

Are the PK/SK index size 'reserved' (ie padded) so it would make no difference at all to storage cost if we compress the overall field sizes down to the minimum possible size? PK/SK 索引大小是否“保留”（即填充），因此如果我们将整体字段大小压缩到可能的最小大小，它对存储成本没有任何影响？ (... I read somewhere that 100 bytes is typically reserved. （...我在某处读到通常保留 100 个字节。

If they ARE padded, the cost savings for the data would be reasonably high, because each (tree) index node will be nearly as big as the data being mapped.如果它们被填充，则数据的成本节省将相当高，因为每个（树）索引节点几乎与被映射的数据一样大。 (I assume a tree index is used once hashed PK has routed the query to the right server node/disk etc.) （我假设一旦散列 PK 将查询路由到正确的服务器节点/磁盘等，就会使用树索引。）

Is there any observable query time performance benefit to compacting 36 bytes into 16 (beyond saving a few bytes across the network)?将 36 个字节压缩为 16 个字节（除了通过网络节省几个字节之外）是否有任何可观察到的查询时间性能优势？ ie if Dynamo has to read fewer pages it'll work faster, but in practice are we talking microseconds at best?即，如果 Dynamo 必须读取更少的页面，它会工作得更快，但实际上我们最多只能说微秒吗？

This is a secondary concern, but is worth considering if there is a lot of concurrent access to the data.这是次要问题，但如果对数据有大量并发访问，则值得考虑。 UUIDs will distribute partitions but inevitably sometimes we will have some more active partitions than others. UUID 将分配分区，但不可避免地有时我们会拥有一些比其他分区更活跃的分区。

Are there any tools that can parse bytes back into human-readable UUIDs (or that we customise to inject behaviour to do this)?是否有任何工具可以将字节解析回人类可读的 UUID（或者我们自定义以注入行为来执行此操作）？

This is concern, because making things small and efficient is ok, but supporting and resolving data issues will be difficult without significant tooling investment, and (unsurprisingly) the DynamoDB console, DynamoDB IntelliJ plugin and AWS NoSQL Workbench all garble the binary into unreadable characters.这是值得关注的，因为让事情变得更小更高效是可以的，但是如果没有大量的工具投资，支持和解决数据问题将很困难，而且（不出所料）DynamoDB 控制台、DynamoDB IntelliJ 插件和 AWS NoSQL Workbench 都会将二进制文件乱码成不可读的字符。

1 个解决方案

No, the PK/SK types are not padded.不，不填充 PK/SK 类型。 There's 100 bytes of overhead per item stored.每个存储的项目有 100 字节的开销。

Sending less data certainly won't hurt your performance.发送更少的数据肯定不会影响您的性能。 Don't expect a noticeable improvement though.不过不要指望有明显的改善。 If shorter values can keep your items at 1,024 bytes instead of 1,025 bytes then you save yourself a Write Unit during the save.如果较短的值可以使您的项目保持在 1,024 字节而不是 1,025 字节，那么您在保存期间为自己保存了一个写入单元。

For the "garbled" binary values I assume you're looking at the base64 encoded values, which is a standard binary encoding standard which can be reversed by lots of tooling (now that you know the name of it).对于“乱码”二进制值，我假设您正在查看 base64 编码值，这是一种标准二进制编码标准，可以通过许多工具来反转（现在您知道它的名称）。