简体   繁体   English

如何在MongoDB中有效地存储和查询原始JSON流?

[英]How do I efficiently store and query a raw JSON stream in MongoDB?

I would like to store the raw JSON stream (either via Twitter or the NYTimes) efficiently in MongoDB, so that I can later index the data (NYTimes articles, or Tweets/usernames) with either Lucene or Hadoop. 我想将原始JSON流(通过Twitter或NYTimes)有效地存储在MongoDB中,以便以后可以使用Lucene或Hadoop索引数据(NYTimes文章或Tweets /用户名)。 What's the smartest way of storing data in Mongo? 在Mongo中存储数据的最聪明的方法是什么? Should I just pipe in the JSON, or is there something better? 我应该只传送JSON,还是有更好的选择? I am only using a single machine for mongodb, with 3 replica sets. 我只为mongodb使用一台机器,带有3个副本集。

Is there an efficient (smart) way of writing queries, or storing my data to better-optimize the search-queries? 是否有一种有效(智能)的方式来编写查询或存储我的数据以更好地优化搜索查询?

Is there an efficient (smart) way of writing queries, or storing my data to better-optimize the search-queries? 是否有一种有效(智能)的方式来编写查询或存储我的数据以更好地优化搜索查询?

This totally depends on what kind of queries you need to make and what the usage pattern of your application will be. 这完全取决于您需要进行哪种查询以及应用程序的使用方式。 It would be pretty simple to store each tweet in a Mongo Document containing: sender, timestamp, text, etc. Depending on what queries you need to make, you will need to create indexes on these fields (more info: http://www.mongodb.org/display/DOCS/Indexes ) 将每个tweet存储在包含以下内容的Mongo Document中将非常简单:发件人,时间戳,文本等。根据需要进行的查询,您将需要在这些字段上创建索引(更多信息: http:// www .mongodb.org / display / DOCS / Indexes

For full text search, you could tokenize/parse/stem the text of the tweets and store an array of tokens with each tweet which you can index to make queries on it fast. 对于全文搜索,您可以对推文的文本进行标记/解析/词干处理,并在每个推文中存储一个标记数组,您可以对其进行索引以对其进行快速查询。 If you need more powerful full text search features, you could also index them with Lucene and store the objectId in each lucene document - but this introduces the complexity of essentially having 2 data stores 如果您需要更强大的全文本搜索功能,则还可以使用Lucene对其进行索引,并将objectId存储在每个Lucene文档中-但这引入了本质上拥有2个数据存储区的复杂性

Again, there's really no right answer here without knowing the details of the use case. 同样,在不了解用例细节的情况下,这里确实没有正确的答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM