简体繁体 English

如何处理大型mongodb集合

[英]How to handle large mongodb collection

原文 2014-12-24 05:01:50 0 1 mongodb/ collections/ analytics/ sharding

We have a collection that is potentially going to be very large.This collection used to store Bill releated data. 我们有一个可能很大的集合，该集合用于存储Bill相关的数据。 So this is often used to reporting/Analytics purpose. 因此，这通常用于报告/分析目的。

Please let me know the best approch to handle this large collection 请让我知道处理此大量收藏的最佳方法

1) Can I split and archive the old data(say 12 months period)?.But here old data is required to get analytic reports.I want to query this old data to show the sale comparion for past 2 yesrs. 1）我可以拆分和存档旧数据（例如12个月）吗？但是这里需要旧数据才能获得分析报告。我想查询此旧数据以显示过去2年的销售比较。

2)can I have new collection with old data(12 months) .So for every 12 months i've to create new collection. 2）我可以使用旧数据（12个月）创建一个新集合吗。因此，每12个月我就必须创建一个新集合。 For reports generation,I've to access all this documents to query. 为了生成报告，我必须访问所有这些文档以进行查询。 So this will cause performance problem? 那么这会导致性能问题吗？

3) Can I go for Sharding? 3）我可以去分片吗？

1 个解决方案

There are many variables to account for, the clearest being what hardware you use, how the data is structured, and how it is queried. 有许多变量要考虑，最清楚的是您使用什么硬件，如何组织数据以及如何查询数据。 A distributed network ought to be able to chew through your data faster than a single machine, but before diving into that solution I recommend generating an absurd amount of mock data comparable to what you are expecting, and then testing various approaches. 分布式网络应该能够比单台计算机更快地浏览数据，但是在深入研究该解决方案之前，我建议生成可与您期望的数量相媲美的模拟数据，然后测试各种方法。 Seriously. 认真。 Create a bunch of data, and try to break things. 创建一堆数据，然后尝试破坏事物。 It's fun! 很有趣！ Soon enough you'll know more about what your problem requires than any website could tell you. 很快，您将比任何网站都能知道的更多有关问题的信息。

As for direct responses: 至于直接回应：

Perhaps, before archiving the data, appropriate stats summaries can be generated (or updated). 也许在存档数据之前，可以生成（或更新）适当的统计摘要。 Those summaries/simplifications can be used for sale comparisons without reloading all of the archived data they represent. 这些摘要/简化可用于销售比较，而无需重新加载它们代表的所有存档数据。
This strikes me as sensible. 这使我感到明智。 By splitting up the sales data, you have more control over how much data needs to be accessed. 通过拆分销售数据，您可以更好地控制需要访问多少数据。 After all, a user won't always wish to see 3 years of data, they may only wish to see last week's. 毕竟，用户并不总是希望看到3年的数据，他们可能只希望看到上周的数据。
Move to sharding when you actually need it. 在实际需要时转移到分片。 As is stated on the MongoDB site: 如MongoDB网站上所述：

Converting an unsharded database to a sharded cluster is easy and seamless, so there is little advantage in configuring sharding while your data set is small. 将未分片的数据库转换为分片的集群既简单又无缝，因此在数据集较小的情况下配置分片几乎没有优势。

You'll know it's time when your memory-map approaches the server's RAM limit. 您将知道是时候内存映射接近服务器的RAM限制了。 MongoDB supports reading and writing to databases too large to keep in memory, but I'm sure you already know that is SLOW . MongoDB支持对太大而无法保存在内存中的数据库进行读写，但是我确定您已经知道这是SLOW 。