简体   繁体   English

Lambda with DynamoDB触发器在表上具有超过500000个不同值的分区键

[英]Lambda with DynamoDB Trigger on a table Partition Key with more than 500000 distinct values

We are currently designing a dynamodb table to store certain file attributes. 我们目前正在设计一个dynamodb表来存储某些文件属性。 There are 2 main columns 有2个主要栏目

  1. Date:- This contains the date in YYMMDD format for ex:-20190618 日期: - 包含YYMMDD格式的日期,例如:-20190618
  2. FileName:- xxxxxxxxxxx.json FileName: - xxxxxxxxxxx.json

Currently the partition key is Date and sort key is FileName. 目前,分区键是Date,排序键是FileName。 We expect about 500000 files with distinct file names on each day (this can increase over period of time) . 我们预计每天大约有500000个文件具有不同的文件名(这可能会在一段时间内增加)。 The File names will repeated same each day ie a typical schema is as shown below 文件名将每天重复相同,即典型的模式如下所示

Date FileName 20190617 abcd.json 20190618 abcd.json Date FileName 20190617 abcd.json 20190618 abcd.json

We have a series of queries that is based on Date and a dynamodb trigger. 我们有一系列基于Date和dynamodb触发器的查询。 The queries are working great. 查询工作得很好。 Currently what we are observing is that the number of concurrent lambda executions are limited to 2, since we are partition by date. 目前我们观察到的是并发lambda执行的数量限制为2,因为我们是按日期分区。 While trying to improve the concurrency of lambda we came across 2 solutions 在尝试改善lambda的并发性时,我们遇到了两个解决方案

1) Referring the following link ( https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-sharding.html ) , one idea is add a fixed number of random suffixes for Date Field ie (20190617.1 to 20190617.500) to split the data in to 500 partitions with 1000 records each . 1)参考以下链接( https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-sharding.html ),一个想法是为Date Field添加固定数量的随机后缀,即(20190617.1到20190617.500)将数据拆分为500个分区,每个分区有1000条记录。 This would ensure an amount of concurrency and also there will be minimal changed to query 这将确保一定程度的并发性,并且对查询的更改也很少

2) Second option is to change partitioning of table as follows Partition Key :- FileName and SortKey :- Date. 2)第二个选项是更改表的分区,如下所示:分区键: - FileName和SortKey: - Date。 This will result in about 500000 partitions , (which can increase) . 这将导致大约500000个分区(可以增加)。 For querying by date we will need to add a GSI, but we will achieve more concurrency in Lambda 对于按日期查询,我们需要添加一个GSI,但我们将在Lambda中实现更多的并发性

we have not created a table with 500000 partitions (which can increase). 我们还没有创建一个包含500000个分区的表(可以增加)。 Any body has such experience... If so please comment 任何机构都有这样的经历......如果有,请评论

Any help is appreciated 任何帮助表示赞赏

You seem to be under the mistaken impression that there's a one to one correspondence between partition keys and partitions. 您似乎错误地认为分区键和分区之间存在一对一的对应关系。

This is not the case. 不是这种情况。

The number of partitions is driven by table size and through-put. 分区数由表大小和吞吐量驱动。 The partition key is hashed by DDB and the data is stored in a particular partition. 分区键由DDB进行散列,数据存储在特定分区中。

You could have 100k partition keys and only a single partition. 您可以拥有100k分区密钥,只有一个分区。

If you're pushing the limits of DDB, then yeah you might end up with only a single partition key in a partition...but that's not typical. 如果你正在推动DDB的限制,那么你可能最终只能在一个分区中使用一个分区键...但这不是典型的。

The DDB Whitepaper provides some details into how DDB works... DDB白皮书提供了DDB如何工作的一些细节......

Partitioning by file name doesn't make a lot of sense if your access pattern is to query by date. 如果您的访问模式是按日期查询,则按文件名分区并没有多大意义。

Instead, the idea of increasing the number of partitions for each date by adding a suffix seems fine. 相反,通过添加后缀来增加每个日期的分区数的想法似乎很好。 But rather than adding a random suffix, you might consider adding a stable suffix based on the name of the file: 但是,您可以考虑根据文件名添加稳定的后缀,而不是添加随机后缀:

You could use the first letter of the file name, to get about 30 partitions - assuming the file names are random. 你可以使用文件名的第一个字母来获得大约30个分区 - 假设文件名是随机的。 The only trouble is some letter might be more common than others giving skewed subpartitions 唯一的麻烦是某些字母可能比其他字母更为常见,从而产生偏差的子分区

Or, you could take a hash of the file name and use that as the suffix for the partition key. 或者,您可以获取文件名的哈希值,并将其用作分区键的后缀。 The hash function could be a relatively simple hash function that produces a target numeric value corresponding to the number of sub partitions you would like to have for each date. 散列函数可以是一个相对简单的散列函数,它产生一个目标数值,该值对应于您希望为每个日期分配的子分区数。

If you end up with about 10000-50000 items per partition it would probably be great. 如果每个分区最终得到大约10000-50000个项目,那么它可能会很棒。

Hope this helps 希望这可以帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM