简体繁体 English

当我知道 95% 未使用时，如何有效地查询 MongoDB 的文档

[英]How to efficiently query MongoDB for documents when I know that 95% are not used

原文 2020-07-26 08:54:13 1 1 mongodb/ query-performance

I have a collection of ~500M documents.我收集了约 5 亿份文档。 Every time when I execute a query, I receive one or more documents from this collection.每次执行查询时，我都会从该集合中收到一个或多个文档。 Let's say I have a counter for each document, and I increase this counter by 1 whenever this document is returned from the query.假设我对每个文档都有一个计数器，每当从查询返回此文档时，我都会将此计数器加 1。 After a few months of running the system in production, I discover that the counter of only 5% of the documents is greater than 0 (zero).在生产中运行该系统几个月后，我发现只有 5% 的文档的计数器大于 0（零）。 Meaning, 95% of the documents are not used.意思是，95% 的文档没有被使用。

My question is: Is there an efficient way to arrange these documents to speedup the query execution time, based on the fact that 95% of the documents are not used?我的问题是：基于 95% 的文档未使用这一事实，是否有一种有效的方法来安排这些文档以加快查询执行时间？

What is the best practice in this case?在这种情况下，最佳做法是什么？

If - for example - I will add another boolean field for each document named "consumed" and index this field.如果 - 例如 - 我将为每个名为“consumed”的文档添加另一个 boolean 字段并索引该字段。 Can I improve the query execution time somehow?我可以以某种方式改善查询执行时间吗？

1 个解决方案

~500M documents That is quite a solid figure, good job if that's true. ~500M documents这是一个相当可靠的数字，如果这是真的，那就太好了。 So here is how I see the solution of the problem:所以这是我如何看待问题的解决方案：

If you want to re-write/re-factor and rebuild the DB of an app.如果您想重写/重构并重建应用程序的数据库。 You could use versioning pattern.您可以使用版本控制模式。

How does it looks like?它看起来怎么样？

Imagine you have a two collections (or even two databases, if you are using micro service architecture)想象一下，您有两个 collections（甚至两个数据库，如果您使用的是微服务架构）

Relevant docs / Irrelevant docs.相关文档/不相关文档。

Basically you could use find only on relevant docs collection (which store 5% of your useful docs) and if there is nothing, then use Irrelevant.find() .基本上，您可以仅在relevant文档集合（其中存储 5% 的有用文档）上使用 find ，如果没有，则使用Irrelevant.find() 。 This pattern will allows you to store old/historical data.此模式将允许您存储旧/历史数据。 And manage it via TTL index or capped collection .并通过TTL index或capped collection进行管理。

You could also add some Redis magic to it.你也可以给它添加一些 Redis 魔法。 (Which uses precisely the same logic), take a look: （使用完全相同的逻辑），看看：

This article can also be helpful (as many others, like this SO question ) 这篇文章也很有帮助（和其他许多人一样，比如这个 SO question ）

But don't try to replace Mongo with Redis, team them up instead.但是不要尝试用 Redis 替换 Mongo，而是将它们组合起来。

Using Indexes and .explain()使用Indexes和.explain()

If - for example - I will add another boolean field for each document named "consumed" and index this field.如果 - 例如 - 我将为每个名为“consumed”的文档添加另一个 boolean 字段并索引该字段。 Can I improve the query execution time somehow?我可以以某种方式改善查询执行时间吗？

Yes, it will deal with your problem.是的，它将解决您的问题。 To take a look, download MongoDB Compass , create this boolean field in your schema, (don't forget to add default value), index the field and then use Explain module with some query.看看，下载 MongoDB Compass ，在你的模式中创建这个boolean字段，（不要忘记添加默认值），索引字段，然后使用Explain模块进行一些查询。 But don't forget about compound indexes!但是不要忘记compound索引！ If you create field on one index, measure the performance by queering only this one field.如果您在一个索引上创建字段，请仅通过查询该字段来衡量性能。

The result should been looks like this:结果应该是这样的：

If your index have usage (and actually speed-up) Compass will shows you it.如果您的索引有使用情况（并且实际上加速了），Compass 会向您显示它。

To measure the performance of the queries (with and without indexing), use Explain tab.要测量查询的性能（有和没有索引），请使用Explain选项卡。

Actually, all this part can be done without Compass itself, via .explain and .index queries.实际上，所有这部分都可以在没有 Compass 本身的情况下通过.explain和.index查询来完成。 But Compass got better visuals of this process, so it's better to use it.但是 Compass 对这个过程有更好的视觉效果，所以最好使用它。 Especially since he becomes absolutely free for all.尤其是因为他对所有人都完全自由了。