简体   繁体   中英

Correct way of structuring data in MongoDB

I have problems with not understanding what is the right way to store data in mongodb. Tried reading a lot of links, but could not arrive at a solid conclusion. I am used to the RDBMS style. What I have in hand is a data with relations and the db is Mongo. To the problem -- Say for example I have a book collection which could have around 2million books. There is also something called subscription(Eg: premium, standard etc.) . Each book of the total 2 million will for sure be at least part of any one of the subscription(could be part of multiple subscriptions as well). I can have upto a total of 200 subscriptions in the system.

This is the point that is concerning. How do I frame my collections here. I tried the following

. Create a collection named subscription_book_association where one document corresponds to a subscription and I store all the book ids for this subscription as a json within the document. Here i face the problem where if there are more than 0.4million books for a subscription I have to store the ids of all these books within the same document and I end up exceeding 16MB limit for a document.

. Create a collection named book_subscription_association where one document corresponds to a book and I store all the subscription ids against each book (as an array) inside the document. In this case I see that whenever I do any write operation on my data(eg assign/unassign a few new books to a subscription), I basically have to do update the subscription array, using the $push/$pull operator. This seems to be taking too long (say 3-4 mins).

Eg:

Subscription

{
        "_id" : "Standard",
        "description" : "Standard Subscription",                
        "status" : "Active",        
}

Book

{
        "_id" : "",
        "name" : "Java for beginners",
        "code" : "TECH",
        "vendor" : "XX Publications"
        "Author" : "AAA"
        "Year" : "2010"     
}

book_subscription_association

{
        "_id" : "",        
        "code" : "TECH",        
        "displayName" : "TECH/Java for beginners",
        "name" : "Java for beginners",
        "permission" : [
                "Standard:R",
                "Guest:R"
                "Premium:RW"                
        ],
        "roles" : [
                "Standard",
                "Premium",
                "Guest"
        ]
}

Query to update

db.book_subscription_association.update( { }, { $pull: { roles: "Guest" } }, false,true)
db.book_subscription_association.update( { }, { $push: { roles: "Guest" } }, false,true)

. Create a collection named book_subscription_mapping (like a mapping table in RDBMS) where I store the association individually for each book against each applicable subscription. In this case the number of documents that I have in this collection is heavy huge. The worst case is that I have (2 million X 200) documents in this collection. This eats up a lot of storage memory and the update/read queries are also not very effective.

The approach you take should be based on the types of queries that you expect to have more frequently.

For example, if you expect more queries asking what are the available books in a subscription, you should include in your subscription document a list containing the details that you expect to show the user (id, title, etc).

If on the other hand you expect more queries asking what subscriptions include a certain book, then you should include all details needed for the subscriptions in that book document.

Practically, in your case, the choice between approach 1 or approach 2 is strictly based on how you expect your queries to take place.

Regarding your concern with storing ids for approach 1, you can use a reverse approach in case the book collection for a subscription gets very large (store in a separate field only the ids of the books that are NOT included in that particular subscription). Depending on your expecting subscription coverage, this might actually be effective as a general pattern.

If this reverse approach does not work (you still have too many books in each subscription), then your best course of action is to follow approach 2 and index the array holding the list of subscriptions. The update commands that you showed in the post affect the whole collection (2 mil items) so it's natural that they take a little longer.

For more info on how to denormalize tables, MongoDB has a nice series of blog posts on the topic .

Denormalization is the first thing you should keep in mind when you are modelling your collection documents. You can keep both "Book data & Subscription data" in single collection, it is always recommended to keep all the related data for a query or sequence of queries in the same disk location (same collection) for better performance.

Refer the below link for effective model design.

Ref: Updating large number of records in a collection

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM