简体   繁体   中英

DynamoDB data model for urls and associated keywords

I have items in a DynamoDB table. Each item has a list key words against a URL (URL is partition key in my table) from which these words has been extracted. Now I want to query the table for one keyword and determine which URL/s has/have this particular word.

One way is to loop through each item in the table and then again loop through the respective list of keywords to complete the query. Another option is that I store each word as partition key in item and place respective URLs against each. But in this case my crawler lambda will be slowed.

What you think, there can be another way to achieve the desired results?

In contrast to data modeling in relational databases, you design your DynamoDB schemas in such a way that reads are very quick and simple at the cost of more (compute-)expensive writes.

What you've done now is to design your table in a way that writes are cheap and reads are expensive.

In DynamoDB we think in terms of access patterns that your data model is supposed to serve. In your case that would be getUrlsByKeyword . The easiest solution would be to design your table like this:

keyword (Partition Key) url (Sort Key)
keyword1 https://test.example.com
keyword1 https://test2.example.com
keyword1 https://test3.example.com
wordkey2 https://test.example.com
wordkey2 https://test3.example.com

This allows you to do a Query based on keyword=<keyword> which would return all your URLs that contain this keyword.

How would you update this table?

There's two cases you need to worry about under the assumption that you don't delete URLs from your table:

  1. New URL with keywords
  2. Existing URL with keywords

Solving 1) is easy: For each new keyword-url combination you add a record to the table above.

The update case 2) is a bit more annoying, because you need to figure out what already exists to change it. As a result of that we have a new access pattern getKeywordsByUrl which can't easily be served from the table we've defined so far, so we adjust it.

There is an easy trick we can do: we create an inverted index, meaning a Global Secondary Index that switches the partition and sort key of the base table. The GSI would look like this:

  • Name: GSI1
  • Partition key: url
  • Sort key: keyword

If we view GSI1, we see a table like this:

url (GSI1 Partition key) keyword (GSI1 Sort Key)
https://test.example.com keyword1
https://test.example.com wordkey2
https://test2.example.com keyword1
https://test3.example.com keyword1
https://test3.example.com wordkey2

Now we can easily fetch the keywords for a given URL using a Query on GSI1 with url=<url> . Based on it's result, you can add new keywords to the base table and delete no-longer-existing keywords as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM