简体   繁体   English

单个分区键值的DynamoDB最大分区大小是否为10GB?

[英]Is there a DynamoDB max partition size of 10GB for a single partition key value?

I've read lots of DynamoDB docs on designing partition keys and sort keys, but I think I must be missing something fundamental. 我已经阅读了很多关于设计分区键和排序键的DynamoDB文档,但我认为我必须遗漏一些基本的东西。

If you have a bad partition key design, what happens when the data for a SINGLE partition key value exceeds 10GB? 如果您的分区键设计错误,那么当SINGLE分区键值的数据超过10GB时会发生什么?

The 'Understand Partition Behaviour' section states: “理解分区行为”部分指出:

"A single partition can hold approximately 10 GB of data" “单个分区可容纳大约10 GB的数据”

How can it partition a single partition key? 它如何分区单个分区键?

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions

The docs also talk about limits with a local secondary index being limited to 10GB of data after which you start getting errors. 文档还讨论限制,本地二级索引限制为10GB数据,之后您开始收到错误。

"The maximum size of any item collection is 10 GB. This limit does not apply to tables without local secondary indexes; only tables that have one or more local secondary indexes are affected." “任何项目集合的最大大小为10 GB。此限制不适用于没有本地二级索引的表;只有具有一个或多个本地二级索引的表才会受到影响。”

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LSI.html#LSI.ItemCollections http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LSI.html#LSI.ItemCollections

That I can understand. 我能理解。 So does it have some other magic for partitioning the data for a single partition key if it exceeds 10GB. 如果它超过10GB,它还有一些其他的魔法来分割单个分区键的数据。 Or does it just keep growing that partition? 或者只是继续增长分区? And what are the implications of that for your key design? 这对您的关键设计有何影响?

The background to the question is that I've seen lots of examples of using something like a TenantId as a partition key in a multi-tentant environment. 这个问题的背景是我已经看到很多在TenantId中使用类似TenantId作为分区键的例子。 But that seems limiting if a specific TenantId could have more than 10 GB of data. 但是,如果特定的TenantId可以拥有超过10 GB的数据,那似乎是有限的。

I must be missing something? 我肯定错过了什么?

TL;DR - items can be split even if they have the same partition key value by including the range key value into the partitioning function. TL; DR - 项目可以通过将范围键值包含在分区函数中来拆分,即使它们具有相同的分区键值。


The long version: 长版:

This is a very good question, and it is addressed in the documentation here and here . 这是一个非常好的问题,它在这里这里的文档中得到了解决。 As the documentation states, items in a DynamoDB table are partitioned based on their partition key value (which used to be called hash key ) into one or multiple partitions, using ah ashing function . 正如文档所述,DynamoDB表中的项目使用ah ashing函数根据其分区键值(以前称为哈希键 )分区为一个或多个分区。 The number of partitions is derived based on the maximum desired total throughput, as well as the distribution of items in the key space. 分区数是根据最大期望总吞吐量以及密钥空间中项目的分布得出的。 In other words, if the partition key is chosen such that it distributes items uniformly across the partition key space, the partitions end up having approximately the same number of items each. 换句话说,如果选择分区密钥使得它在分区密钥空间上均匀地分配项目,则分区最终具有大致相同数量的项目。 This number of items in each partition is approximately equal to the total number of items in the table divided by the number of partitions. 每个分区中的项目数大约等于表中项目的总数除以分区数。

The documentation also states that each partition is limited to about 10GB of space. 该文档还指出每个分区限制为大约10GB的空间。 And that once the sum of the sizes of all items stored in any partition grows beyond 10GB, DynamoDB will start a background process that will automatically and transparently split such partitions in half - resulting in two new partitions. 并且,一旦存储在任何分区中的所有项目的大小总和超过10GB,DynamoDB将启动后台进程,该进程将自动且透明地将这些分区分成两半 - 从而产生两个新分区。 Once again, if the items are distributed uniformly, this is great because each new sub-partition will end up holding roughly half the items in the original partition. 再一次,如果项目是均匀分布的,这很好,因为每个新的子分区最终将保留原始分区中大约一半的项目。

An important aspect to splitting is that the throughput of the split-partitions will each be half of the throughput that would have been available for the original partition. 拆分的一个重要方面是拆分分区的吞吐量每个都是原始分区可用吞吐量的一半。

So far we've covered the happy case. 到目前为止,我们已经涵盖了幸福的案例。

On the flip side it is possible to have one, or a few, partition key values that correspond to a very large number of items. 另一方面,可以具有对应于非常大量项目的一个或几个分区键值。 This can usually happen if the table schema uses a sort key and several items hash to the same partition key. 如果表模式使用排序键并且多个项散列到同一分区键,则通常会发生这种情况。 In such case, it is possible that a single partition key could be responsible for items that together take up more than 10 GB. 在这种情况下,单个分区键可能负责共同占用10 GB以上的项目。 And this will result in a split. 这将导致分裂。 In this case DynamoDB will still create two new partitions but instead of using only the partition key to decide which sub-partition should an item be stored in, it will also use the sort key. 在这种情况下,DynamoDB仍将创建两个新分区,但不是仅使用分区键来决定项目应存储在哪个子分区中,它还将使用排序键。

Example

Without loss of generality and to make things easier to reason about, imagine that there is a table where partition keys are letters (AZ), and numbers are used as sort keys. 在不失一般性并且易于推理的情况下,假设有一个表格,其中分区键是字母(AZ),并且数字用作排序键。

Imaging that the table has about 9 partitions, so letters A,B,C would be stored in partition 1, letters D,E,F would be in partition 2, etc. 该表具有大约9个分区的成像,因此字母A,B,C将存储在分区1中,字母D,E,F将存储在分区2中,等等。

In the diagram below, the partition boundaries are marked h(A0) , h(D0) etc. to show that, for instance, the items stored in the first partition are the items who's partition key hashes to a value between h(A0) and h(D0) - the 0 is intentional, and comes in handy next. 在下图中,分区边界标记为h(A0)h(D0)等,以显示,例如,存储在第一个分区中的项是分区键散列到h(A0)之间的值的项和h(D0) - 0是有意的,接下来就派上用场了。

[ h(A0) ]--------[ h(D0) ]---------[ h(G0) ]-------[ h(J0) ]-------[ h(M0) ]- ..
  |   A    B    C   |       E    F   |   G      I    |   J    K   L  |
  |   1    1    1   |       1    1   |   1      1    |   1    1   1  |
  |   2    2    2   |       2    2   |          2    |        2      |
  |   3         3   |            3   |          3    |               |
  ..                ..               ..              ..              ..
  |            100  |           500  |               |               |
  +-----------------+----------------+---------------+---------------+-- ..

Notice that for most partition key values, there are between 1 and 3 items in the table, but there are two partition key values: D and F that are not looking too good. 请注意,对于大多数分区键值,表中有1到3个项目,但有两个分区键值: DF看起来不太好。 D has 100 items while F has 500 items. D有100件物品,而F有500件。

If items with a partition key value of F keep getting added, eventually the partition [h(D0)-h(G0)) will split. 如果分区键值为F不断添加,最终分区[h(D0)-h(G0))将被分割。 To make it possible to split the items that have the same hash key, the range key values will have to be used, so we'll end up with the following situation: 为了能够拆分具有相同散列键的项,必须使用范围键值,因此我们最终会遇到以下情况:

..[ h(D0) ]------------/ [ h(F500) ] / ----------[ h(G0) ]- ..
      |       E       F       |           F         |
      |       1       1       |          501        |
      |       2       2       |          502        |
      |               3       |          503        |
      ..                      ..                    ..
      |              500      |         1000        |
.. ---+-----------------------+---------------------+--- ..

The original partition [h(D0)-h(G0)) was split into [h(D0)-h(F500)) and [h(F500)-h(G0)) 原分区[h(D0)-h(G0))分为[h(D0)-h(F500))[h(F500)-h(G0))

I hope this helps to visualize that items are generally mapped to partitions based on a hash value obtained by applying a hashing function to their partition key value, but if need be, the value being hashed can include the partition key + a sort key value as well. 我希望这有助于可视化项目通常根据通过将散列函数应用于其分区键值而获得的散列值映射到分区,但是如果需要,散列值可以包括分区键+排序键值为好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM