简体   繁体   中英

Modeling N-to-N with DynamoDB

I'm working in a project that uses DynamoDB for most persistent data. I'm now trying to model a data structure that more resembles what one would model in a traditional SQL database, but I'd like to explore the possibilities for a good NoSQL design also for this kind of data. As an example, consider a simple N-to-N relation such as items grouped into categories. In SQL, this might be modeled with a connection table such as

items
-----
item_id (PK)
name

categories
----------
category_id (PK)
name

item_categories
---------------
item_id     (PK)
category_id (PK)

To list all items in a category, one could perform a join such as

SELECT items.name from items 
  JOIN item_categories ON items.item_id = item_categories.item_id
  WHERE item_categories.category_id = ?

And to list all categories to which an item belongs, the corresponding query could be made:

SELECT categories.name from categories 
  JOIN item_categories ON categories.category_id = item_categories.category_id 
  WHERE item_categories.item_id = ?

Is there any hope in modeling a relation like this with a NoSQL database in general, and DynamoDB in particular, in a fairly efficient way (not requiring a lot of ( N , even?) separate operations) for a simple use-case like the ones above - when there is no equivalent of JOIN s?

Or should I just go for RDS instead?

Things I have considered:

  1. Inline categories as an array within item. This makes it easy to find the categories of an item, but does not solve getting all items within a category. And I would need to duplicate the needed attributes such as category name etc within each item . Category updates would be awkward.

  2. Duplicate each item for each category and use category_id as range key, and add a GSI with the reverse ( category_id as hash, item_id as range). De-normalizing being common for NoSQL, but I still have doubts. Possibly split items into items and item_details and only duplicate the most common attributes that are needed in listings etc.

  3. Go for a connection table mapping items to categories and vice versa. Use [item_id, category_id] as key and [category_id, item_id] as GSI, to support both queries. Duplicate the most common attributes (name etc) here. To get all full items for a category I would still need to perform one query followed by N get operations though, which consumes a lot of CU:s. Updates of item or category names would require multipe update operations, but not too difficult.

The dilemma I have is that the format of the data itself suits a document database perfectly, while the relations I need fit an SQL database. If possible I'd like to stay with DynamoDB, but obviously not at any cost...

You are already in looking in the right direction!

In order to make an informed decision you will need to also consider the cardinality of your data:

Will you be expecting to have just a few (less then ten?) categories? Or quite a lot (ie hundreds, thousands, tens of thousands etc.)

How about items per category: Do you expect to have many cagories with a few items in each or lots of items in a few categories?

Then, you need to consider the cardinality of the total data set and the frequency of various types of queries. Will you most often need to retrieve only items in a single category? Or will you be mostly querying to retrieve items individually and you just need stayistics for number of items per category etc.

Finally, consider the expected growth of your dataset over time. DynamoDB will generally outperform an RDBMS at scale as long as your queries partition well.

Also consider the acceptable latency for each type of query you expect to perform, especially at scale. For instance, if you expect to have hundreds of categories with hundreds of thousands of items each, what does it mean to retrieve all items in a category? Surely you wouldn't be displaying them all to the user at once.

I encourage you to also consider another type of data store to accompany DynamoDB if you need statistics for your data, such as ElasticSearch or a Redis cluster.

In the end, if aggregate queries or joins are essential to your use case, or if the dataset at scale can generally be processed comfortably on a single RDBMS instance, don't try to fit a square peg in a round hole. A managed RDBMS solution like Aurora might be a better fit.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM