简体   繁体   中英

DynamoDB Query in a tight loop or scan?

Here is my basic data structure (or the relevant portions anyway) in DynamoDB; I have a files table that holds file data and has an id for the file. I also have a 'Definitions' table that holds items defined in the file. Definitions also have an ID (as the primary key) as well as a field called 'SourceFile' that references the file id in order to tie the definition to it's source file.

Most of the time I want to just get the definition by it's id and optionally get the file later which works just fine. However, in some cases I need to get all definitions for a set of files. I can do this with a scan but it's slow and I know that it will get slower as the table grows and isn't recommended. However I'm not sure how to do this with a query.

I can create a GSI that uses the SourceFile field as the primary key and use that to query against. This sounds like an answer (and may be), however I'm not sure. The problem is that some libraries may have 5k or 10k files (maybe more in rare cases). In a GSI I can only query against 1 file ID per query so I would have to throw a new query for each file and I can't imagine it's going to be very efficient to throw 10K queries at DynamoDB...

Is it better to create a tight loop (or multiple threads) and hit it with a ton of queries or to scan the table? Is there another way to do this that I'm not thinking of?

This is during an indexing and analysis process that is expected to take a bit of time so it's ok that it's not instant but I'd like it to be as efficient as possible...

Scans are the most efficient if you expect to be looking for a majority of data in your database. You can retrieve up to 1MB per scan request, and for each unit of capacity available you can read 4KB, so assuming you have enough capacity provisioned, you can retrieve thousands of items in a single request (assuming the items are pretty small).

The only alternative I can think of is to add more metadata that can help you index the files & definitions at a higher level - like, for instance, the library name/id. With that you can create a GSI on library name/id and query that way.

Running thousands of queries is going to less efficient than scanning assuming you are storing on the order of tens/hundreds of thousands of items.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM