简体   繁体   中英

MongoDB Schema Design for language database

I need some advice on MongoDB schema design for a natural language database.

I need to store for each language texts and words like:

lang: {
    _id: "English",
    texts : [
        {   text : "This is a first text", 
            date : Date("2011-09-19T04:00:10.112Z"),
            tag : "test1"
        },
        {   text : "Second One", 
            date : Date("2011-09-19T04:00:10.112Z"),
            tag : "test2"
        }
    ],
    words : [
        {   
            word : "This",
        },
        {   
            word : "is",
        },
        {   
            word : "a",
        },
        {   
            word : "first",
        },
        {   
            word : "text",
        },
        {   
            word : "second",
        },
        {   
            word : "one",
        }


    ]

}

And then I need to know each words and texts a user has associated. The word/text amount tends to be huge and I need to list all words on a language and all words a user has associated for that language.

From my perspective I think storing the user_ids that are associated with a given word in an array for the word is maybe a good approach like:

lang: {
    _id: "English",
    texts : [
                ...
    ],
    words : [
        {   
            word : "This",
            users: [user1,user2,user3]
        },
        {   
            word : "is",
                users: [user1,user2]
                },
                ...
    ]
}

Having in mind that a word can be associated to hundreds of thousand of users and the document limit (as I read) is 4MB and that I need to:

  1. List all words for a given user and language

Is this a good approach? Or can you think of a better one?

Hope this question is clear enough and that someone can give me a help on this ;)

Thank you all!

I don't think this is a good approach, for just the reason you mention: the document size limit. It looks like with your approach, you are definitely going to run up against the limit. I would go for a flatter approach (which should also make your collection easier to query). Something like this:

[
    {
        user: "user1",
        word: "This",
        lang: "en"
    },
    {
        user: "user1",
        word: "is",
        lang: "en"
    },
    // et cetera...
]

In other words, grow vertically by adding documents rather than horizontally by adding more data to one document. You can query words for a given user with db.find( { user: "user1", lang: "en" }); .

This approach isn't "normalized", of course, so if you're concerned about space then you might want to create a separate collection for users, words, and languages and reference them in the main collection by an ID. But since there are no join queries in MongoDB, you have to weigh query performance against space efficiency.

dbaseman is correct (and upvoted), but a couple of other points:

First, the document limit is now 16MB ( Max Document Size ), as of this writing, assuming you are running a recent versionof MongoDB.

Second, unbounded growth is generally a bad idea in MongoDB, this type of document size expansion can cause MongoDB to have to move the document if it exceeds the current space allocated to it. You can read more about this in the Padding Factor section of the documentation.

Those types of moves are relatively expensive, especially if they happen frequently. Therefore, if you do go with this type of design limiting the size (essentially bounding that growth) of the comments equivalent in your main collection (most recent X, most popular X etc.) and perhaps even pre-populating that document field (essentially manual padding) to beyond the average size will reduce the moves caused additions/changes.

This is the reason why tip #6 in the MongoDB Developers tips and tricks book from O'Reilly is:

Tip #6: Do not embed fields that have unbound growth

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM