简体繁体 English

DynamoDB查询处于紧密循环还是扫描？

[英]DynamoDB Query in a tight loop or scan?

原文 2017-12-13 20:18:14 9 1 amazon-web-services/ amazon-dynamodb

Here is my basic data structure (or the relevant portions anyway) in DynamoDB; 这是我在DynamoDB中的基本数据结构（或相关部分）。 I have a files table that holds file data and has an id for the file. 我有一个文件表，其中包含文件数据并具有该文件的ID。 I also have a 'Definitions' table that holds items defined in the file. 我还有一个“定义”表，其中包含文件中定义的项目。 Definitions also have an ID (as the primary key) as well as a field called 'SourceFile' that references the file id in order to tie the definition to it's source file. 定义还具有一个ID（作为主键）以及一个名为“ SourceFile”的字段，该字段引用文件ID，以便将定义与其源文件绑定在一起。

Most of the time I want to just get the definition by it's id and optionally get the file later which works just fine. 大多数时候，我只想通过它的id来获取定义，并可以选择以后再获取该文件，该文件就可以了。 However, in some cases I need to get all definitions for a set of files. 但是，在某些情况下，我需要获取一组文件的所有定义。 I can do this with a scan but it's slow and I know that it will get slower as the table grows and isn't recommended. 我可以通过扫描来做到这一点，但是它很慢，而且我知道随着表的增长它会变得越来越慢，不建议这样做。 However I'm not sure how to do this with a query. 但是我不确定如何使用查询来执行此操作。

I can create a GSI that uses the SourceFile field as the primary key and use that to query against. 我可以创建一个使用SourceFile字段作为主键的GSI，并使用该GSI进行查询。 This sounds like an answer (and may be), however I'm not sure. 这听起来像是一个答案（可能是），但是我不确定。 The problem is that some libraries may have 5k or 10k files (maybe more in rare cases). 问题在于某些库可能具有5k或10k的文件（在极少数情况下可能会更多）。 In a GSI I can only query against 1 file ID per query so I would have to throw a new query for each file and I can't imagine it's going to be very efficient to throw 10K queries at DynamoDB... 在GSI中，每个查询我只能针对1个文件ID进行查询，因此我将不得不为每个文件抛出一个新查询，并且我无法想象在DynamoDB上抛出1万个查询将非常有效...

Is it better to create a tight loop (or multiple threads) and hit it with a ton of queries or to scan the table? 创建紧密循环（或多个线程）并用大量查询命中它或扫描表是否更好？ Is there another way to do this that I'm not thinking of? 还有我没有想到的另一种方法吗？

This is during an indexing and analysis process that is expected to take a bit of time so it's ok that it's not instant but I'd like it to be as efficient as possible... 这是在索引和分析过程中，预计会花费一些时间，因此可以确定它不是即时的，但是我希望它尽可能地高效...

1 个解决方案

Scans are the most efficient if you expect to be looking for a majority of data in your database. 如果您希望在数据库中查找大多数数据，则扫描是最有效的。 You can retrieve up to 1MB per scan request, and for each unit of capacity available you can read 4KB, so assuming you have enough capacity provisioned, you can retrieve thousands of items in a single request (assuming the items are pretty small). 每个扫描请求最多可以检索1MB，对于可用的每个容量单位，您可以读取4KB，因此，假设您已配置了足够的容量，则可以在单个请求中检索数千个项目（假设这些项目很小）。

The only alternative I can think of is to add more metadata that can help you index the files & definitions at a higher level - like, for instance, the library name/id. 我能想到的唯一替代方法是添加更多元数据，这些元数据可以帮助您在更高级别上索引文件和定义-例如，库名称/ id。 With that you can create a GSI on library name/id and query that way. 这样，您可以在库名称/ id上创建GSI并进行查询。

Running thousands of queries is going to less efficient than scanning assuming you are storing on the order of tens/hundreds of thousands of items. 假设您要存储成千上万的项目，则运行数千个查询的效率要低于扫描。