简体   繁体   English

Mongodb:大数据结构

[英]Mongodb: big data structure

I'm rebuilding my website which is a search engine for nicknames from the most active forum in France: you search for a nickname and you got all of its messages. 我正在重建我的网站,该网站是法国最活跃的论坛中昵称的搜索引擎:您搜索一个昵称,并获得其所有消息。

My current database contains more than 60Gb of data, stored in a MySQL database. 我当前的数据库包含超过60Gb的数据,存储在MySQL数据库中。 I'm now rewriting it into a mongodb database, and after retrieving 1 million messages (1 message = 1 document) find() started to take a while. 我现在将其重写到mongodb数据库中,并且在检索到100万条消息(1条消息= 1个文档)之后,find()开始花费一段时间。

The structure of a document is as such: 文件的结构如下:

{
  "_id" : ObjectId(),
  "message": "<p>Hai guys</p>",
  "pseudo" : "mahnickname", //from a nickname (*pseudo* in my db)
  "ancre" : "774497928", //its id in the forum
  "datepost" : "30/11/2015 20:57:44"
}

I set the id ancre as unique, so I don't get twice the same entry. 我将id 英亩设置为唯一,因此不会获得相同条目的两倍。

Then the user enters the nickname and it finds all documents that have that nickname. 然后,用户输入昵称,并找到具有该昵称的所有文档。

Here is the request: 这是请求:

Model.find({pseudo: "danickname"}).sort('-datepost').skip((r_page -1) * 20).limit(20).exec(function(err, bears)...

Should I structure it differently? 我应该改变结构吗? Instead of having one document for each message, I'm having a document for each nickname and I update the document once I get a new message from that nickname? 我没有为每个消息提供一个文档,而是为每个昵称都有一个文档,一旦从该昵称收到新消息,我便会更新该文档?

I was using the first method with MySQL et it wasn't taking that long. 我在MySQL中使用第一种方法,并没有花那么长时间。

Edit: Or maybe should I just index the nicknames ( pseudo )? 编辑:或者也许我应该只索引昵称( )?

Thanks! 谢谢!

Here are some recommendations for your problem about big data: 以下是针对您的大数据问题的一些建议:

  1. The ObjectId already contains a timestamp . ObjectId已经包含一个时间戳 You can also sort on it . 您也可以对其进行排序 You could save on some disk space by removing the datepost field. 您可以通过删除datepost字段来节省一些磁盘空间。
  2. Do you absolutely need the ancre field? 您绝对需要ancre田地吗? The ObjectId is already unique and indexed. ObjectId已经是唯一的并已建立索引。 If you absolutely need it and need to keep the datepost seperate too, you could replace the _id field to be your ancre field. 如果您绝对需要它,并且也需要将datepost分开,则可以将_id字段替换为您的ancre字段。
  3. As many mentioned, you should add an index on pseudo . 如前所述,您应该在pseudo上添加一个索引。 This will make the "get all messages where the pseudo is mahnickname" search much faster. 这将使“获取所有伪名称为mahnickname的消息”的搜索速度更快。
  4. If the amount of messages per user is low, you could store all of them inside a single Document per user. 如果每个用户的消息量很低,则可以将所有消息存储在每个用户的单个文档中。 This would avoid having to skip to a specific page, which can be slow. 这样可以避免不得不跳到特定页面,这可能会很慢。 However, be aware of the 16mb limit . 但是,请注意16mb的限制 I would personally still have them in multiple documents. 我个人仍将它们包含在多个文档中。
  5. To keep fast query speeds, ensure that all your indexed fields fit in RAM . 为了保持快速的查询速度,请确保所有索引字段都适合RAM You can see the RAM consumption of indexed fields by typing db.collection.stats() and looking at the indexSizes sub-document. 您可以通过键入db.collection.stats()并查看indexSizes子文档来查看索引字段的RAM消耗。
  6. Would there be a way for you to not skip documents, but use the time it got written to the database as your pages? 有没有一种方法可以让您不跳过文档,而是将其写入数据库的时间用作页面? If so, use the datepost field or the timestamp in _id for your paging strategy. 如果是这样,请为您的分页策略使用datepost字段或_id的时间戳。 If you decide on using the datepost , make a compound index on pseudo and datepost . 如果决定使用datepost ,请在pseudodatepost上建立复合索引

As for your benchmarks, you can closely monitor MongoDB by using mongotop and mongostat . 至于基准,您可以使用mongotopmongostat密切监视MongoDB。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM