简体   繁体   English

MongoDB:文档大小会影响查询性能吗?

[英]MongoDB: does document size affect query performance?

Assume a mobile game that is backed by a MongoDB database containing a User collection with several million documents.假设有一个由 MongoDB 数据库支持的手机游戏,该数据库包含一个包含数百万个文档的User集合。

Now assume several dozen properties that must be associated with the user - eg an array of _id values of Friend documents, their username, photo, an array of _id values of Game documents, last_login date, count of in-game currency, etc, etc, etc..现在假设必须与用户相关联的几十个属性 - 例如, Friend文档的_id值数组、他们的用户名、照片、 Game文档的_id值数组、上次登录日期、游戏货币数量等, 等等..

My concern is whether creating and updating large, growing arrays on many millions of User documents will add any 'weight' to each User document, and/or slowness to the overall system.我担心的是,在数百万个用户文档上创建和更新大型、不断增长的数组是否会给每个用户文档增加任何“权重”,和/或整个系统的速度变慢。

We will likely never eclipse 16mb per document, but we can safely say our documents will be 10-20x larger if we store these growing lists directly.我们可能永远不会让每个文档超过 16mb,但我们可以肯定地说,如果我们直接存储这些不断增长的列表,我们的文档会大 10-20 倍。

Question: is this even a problem in MongoDB?问题:这甚至是 MongoDB 中的问题吗? Does document size even matter if your queries are properly managed using projection and indexes, etc?如果您的查询使用投影和索引等正确管理,文档大小是否重要? Should we be actively pruning document size, eg with references to external lists vs. embedding lists of _id values directly?我们是否应该积极修剪文档大小,例如引用外部列表与直接嵌入_id值列表?

In other words: if I want a user's last_login value, will a query that projects/selects only the last_login field be any different if my User documents are 100kb vs. 5mb?换句话说:如果我想要一个用户的last_login值,如果我的User文档是 100kb 和 5mb,那么项目/选择仅last_login字段的查询会有什么不同吗?

Or: if I want to find all users with a specific last_login value, will document size affect that sort of query?或者:如果我想找到具有特定last_login值的所有用户,文档大小会影响那种查询吗?

One way to rephrase the question is to say, does a 1 million document query take longer if documents are 16mb vs 16kb each.重新表述这个问题的一种方法是说,如果每个文档的大小为 16mb 与 16kb,那么 100 万个文档查询是否需要更长的时间。

Correct me if I'm wrong, from my own experience, the smaller the document size, the faster the query.如果我错了,请纠正我,根据我自己的经验,文档大小越小,查询速度越快。

I've done queries on 500k documents vs 25k documents and the 25k query was noticeably faster - ranging anywhere from a few milliseconds to 1-3 seconds faster.我已经对 500k 文档和 25k 文档进行了查询,25k 查询明显更快 - 从几毫秒到 1-3 秒不等。 On production the time difference is about 2x-10x more.在生产中,时间差大约是 2x-10x。

The one aspect where document size comes into play is in query sorting, in which case, document size will affect whether the query itself will run or not.文档大小发挥作用的一方面是查询排序,在这种情况下,文档大小将影响查询本身是否会运行。 I've reached this limit numerous times trying to sort as little as 2k documents.我已经多次达到这个限制,试图对 2k 的文档进行排序。

More references with some solutions here: https://docs.mongodb.org/manual/reference/limits/#operations https://docs.mongodb.org/manual/reference/operator/aggregation/sort/#sort-memory-limit这里有一些解决方案的更多参考: https : //docs.mongodb.org/manual/reference/limits/#operations https://docs.mongodb.org/manual/reference/operator/aggregation/sort/#sort-memory-限制

At the end of the day, its the end user that suffers.归根结底,受苦的是最终用户。

When I attempt to remedy large queries causing unacceptably slow performance.当我尝试修复导致性能低得令人无法接受的大型查询时。 I usually find myself creating a new collection with a subset of data, and using a lot of query conditions along with a sort and a limit.我通常会发现自己创建一个包含数据子集的新集合,并使用大量查询条件以及排序和限制。

Hope this helps!希望这可以帮助!

First of all you should spend a little time reading up on how MongoDB stores documents with reference to padding factors and powerof2sizes allocation:首先,您应该花一点时间阅读 MongoDB 如何根据填充因子和 powerof2sizes 分配存储文档:

http://docs.mongodb.org/manual/core/storage/ http://docs.mongodb.org/manual/reference/command/collStats/#collStats.paddingFactor http://docs.mongodb.org/manual/core/storage/ http://docs.mongodb.org/manual/reference/command/collStats/#collStats.paddingFactor

Put simply MongoDB tries to allocate some additional space when storing your original document to allow for growth.简单地说,MongoDB 在存储原始文档时会尝试分配一些额外的空间以允许增长。 Powerof2sizes allocation became the default approach in version 2.6, where it will grow the document size in powers of 2. Powerof2sizes 分配成为 2.6 版中的默认方法,它将以 2 的幂增加文档大小。

Overall, performance will be much better if all updates fit within the original size allocation.总体而言,如果所有更新都适合原始大小分配,则性能会好得多。 The reason is that if they don't, the entire document needs to be moved someplace else with enough space, causing more reads and writes and in effect fragmenting your storage.原因是,如果他们不这样做,则整个文档需要移动到其他有足够空间的地方,从而导致更多的读取和写入,并实际上使您的存储碎片化。

If your documents are really going to grow in size by a factor of 10X to 20X overtime that could mean multiple moves per document, which depending on your insert, update and read frequency could cause issues.如果您的文档的大小真的会增加 10 倍到 20 倍,这可能意味着每个文档需要多次移动,这取决于您的插入、更新和读取频率,这可能会导致问题。 If that is the case there are a couple of approaches you can consider:如果是这种情况,您可以考虑以下几种方法:

1) Allocate enough space on initial insertion to cover most (let's say 90%) of normal documents lifetime growth. 1) 在初始插入时分配足够的空间以覆盖大部分(假设 90%)正常文档生命周期增长。 While this will be inefficient in space usage at the beginning, efficiency will increase with time as the documents grow without any performance reduction.虽然这在开始时空间使用效率低下,但随着文档的增长,效率会随着时间的推移而提高,而不会降低性能。 In effect you will pay ahead of time for storage that you will eventually use later to get good performance over time.实际上,您将提前为存储付费,以后最终会使用这些存储来随着时间的推移获得良好的性能。

2) Create "overflow" documents - let's say a typical 80-20 rule applies and 80% of your documents will fit in a certain size. 2) 创建“溢出”文档 - 假设适用典型的 80-20 规则,并且 80%​​ 的文档适合特定尺寸。 Allocate for that amount and add an overflow collection that your document can point to if they have more than 100 friends or 100 Game documents for example.分配该数量并添加一个溢出集合,例如,如果他们有超过 100 个朋友或 100 个游戏文档,则您的文档可以指向该集合。 The overflow field points to a document in this new collection and your app only looks in the new collection if the overflow field exists.溢出字段指向这个新集合中的一个文档,如果溢出字段存在,您的应用程序只会在新集合中查找。 Allows for normal document processing for 80% of the users, and avoids wasting a lot of storage on the 80% of user documents that won't need it, at the expense of additional application complexity.允许 80% 的用户进行正常的文档处理,并避免在 80% 的不需要的用户文档上浪费大量存储空间,代价是增加了应用程序的复杂性。

In either case I'd consider using covered queries by building the appropriate indexes:在任何一种情况下,我都会考虑通过构建适当的索引来使用覆盖查询:

A covered query is a query in which:覆盖查询是这样的查询:

 all the fields in the query are part of an index, and all the fields returned in the results are in the same index.

Because the index “covers” the query, MongoDB can both match the query conditions and return the results using only the index;因为索引“覆盖”了查询,MongoDB既可以匹配查询条件,又可以只使用索引返回结果; MongoDB does not need to look at the documents, only the index, to fulfill the query. MongoDB 不需要查看文档,只需查看索引即可完成查询。

Querying only the index can be much faster than querying documents outside of the index.仅查询索引比查询索引外的文档快得多。 Index keys are typically smaller than the documents they catalog, and indexes are typically available in RAM or located sequentially on disk.索引键通常小于它们编目的文档,并且索引通常在 RAM 中可用或按顺序位于磁盘上。

More on that approach here: http://docs.mongodb.org/manual/tutorial/create-indexes-to-support-queries/更多关于这种方法的信息: http : //docs.mongodb.org/manual/tutorial/create-indexes-to-support-queries/

Just wanted to share my experience when dealing with large documents in MongoDB... don't do it!只是想分享我在 MongoDB 中处理大文档时的经验......不要这样做!

We made the mistake of allowing users to include files encoded in base64 (normally images and screenshots) in documents.我们犯了一个错误,允许用户在文档中包含以 base64 编码的文件(通常是图像和屏幕截图)。 We ended up with a collection of ~500k documents ranging from 2 Mb to 10 Mb each.我们最终得到了大约 50 万个文档的集合,每个文档的大小从 2 Mb 到 10 Mb。

Doing a simple aggregate in this collection would bring down the cluster!在这个集合中做一个简单的聚合会导致集群崩溃!

Aggregate queries can be very heavy in MongoDB, especially with large documents like these. MongoDB 中的聚合查询可能非常繁重,尤其是像这样的大型文档。 Indexes in aggregates can only be used in some conditions and since we needed to $group , indexes were not being used and MongoDB would have to scan all the documents.聚合中的索引只能在某些情况下使用,并且由于我们需要$group ,因此没有使用索引,MongoDB 必须扫描所有文档。

The exact same query in a collection with smaller sized documents was very fast to execute and the resource consumption was not very high.在具有较小尺寸文档的集合中完全相同的查询执行速度非常快,并且资源消耗不是很高。

Hence, querying in MongoDB with large documents can have a big impact in performance, especially aggregates.因此,在 MongoDB 中查询大文档会对性能产生很大影响,尤其是聚合。

Also, if you know that the document will continue to grow after it is created (eg like including log events in a given entity (document)) consider creating a collection for these child items because the size can also become a problem in the future.此外,如果您知道文档在创建后会继续增长(例如,在给定实体(文档)中包含日志事件),请考虑为这些子项创建一个集合,因为大小在将来也会成为问题。

Bruno.布鲁诺。

Short answer: yes.简短的回答:是的。

Long answer: how it will affect the queries depends on many factors, like the nature of the queries, the memory available and the indices sizes.长答案:它将如何影响查询取决于许多因素,例如查询的性质、可用内存和索引大小。

The best you can do is testing.你能做的最好的事情就是测试。

The code bellow will generate two collections named smallDocuments and bigDocuments, with 1024 documents each, being different only by a field 'c' containing a big string and the _id.下面的代码将生成两个名为 smallDocuments 和 bigDocuments 的集合,每个集合有 1024 个文档,不同之处仅在于包含大字符串和 _id 的字段“c”。 The bigDocuments collection will have about 2GB, so be careful running it. bigDocuments 集合将有大约 2GB,所以运行它时要小心。

const numberOfDocuments = 1024;

// 2MB string x 1024 ~ 2GB collection
const bigString = 'a'.repeat(2 * 1024 * 1024);

// generate and insert documents in two collections: shortDocuments and
// largeDocuments;
for (let i = 0; i < numberOfDocuments; i++) {
  let doc = {};
  // field a: integer between 0 and 10, equal in both collections;
  doc.a = ~~(Math.random() * 10);

  // field b: single character between a to j, equal in both collections;
  doc.b = String.fromCharCode(97 + ~~(Math.random() * 10));

  //insert in smallDocuments collection
  db.smallDocuments.insert(doc);

  // field c: big string, present only in bigDocuments collection;
  doc.c = bigString;

  //insert in bigDocuments collection
  db.bigDocuments.insert(doc);
}

You can put this code in a file (eg create-test-data.js) and run it directly in the mongoshell, typing this command:您可以将此代码放在一个文件中(例如 create-test-data.js)并直接在 mongoshell 中运行它,输入以下命令:

mongo testDb < create-test-data.js

It will take a while.这将需要一段时间。 After that you can execute some test queries, like these ones:之后,您可以执行一些测试查询,例如:

const numbersToQuery = [];

// generate 100 random numbers to query documents using field 'a':
for (let i = 0; i < 100; i++) {
  numbersToQuery.push(~~(Math.random() * 10));
}

const smallStart = Date.now();
numbersToQuery.forEach(number => {
  // query using inequality conditions: slower than equality
  const docs = db.smallDocuments
    .find({ a: { $ne: number } }, { a: 1, b: 1 })
    .toArray();
});
print('Small:' + (Date.now() - smallStart) + ' ms');

const bigStart = Date.now();
numbersToQuery.forEach(number => {
  // repeat the same queries in the bigDocuments collection; note that the big field 'c'
  // is ommited in the projection
  const docs = db.bigDocuments
    .find({ a: { $ne: number } }, { a: 1, b: 1 })
    .toArray();
});
print('Big: ' + (Date.now() - bigStart) + ' ms');

Here I got the following results:在这里,我得到了以下结果:

Without index:无索引:

Small: 1976 ms
Big: 19835 ms

After indexing field 'a' in both collections, with .createIndex({ a: 1 }) :在两个集合中索引字段 'a' 后,使用.createIndex({ a: 1 })

Small: 2258 ms
Big: 4761 ms

This demonstrates that queries on big documents are slower.这表明对大文档的查询速度较慢。 Using index, the result time from bigDocuments is more than 100% bigger than in smallDocuments.使用索引,bigDocuments 的结果时间比 smallDocuments 大 100% 以上。

My sugestions are:我的建议是:

  1. Use equality conditions in queries ( https://docs.mongodb.com/manual/core/query-optimization/index.html#query-selectivity );在查询中使用相等条件( https://docs.mongodb.com/manual/core/query-optimization/index.html#query-selectivity );
  2. Use covered queries ( https://docs.mongodb.com/manual/core/query-optimization/index.html#covered-query );使用覆盖查询( https://docs.mongodb.com/manual/core/query-optimization/index.html#covered-query );
  3. Use indices that fit in memory ( https://docs.mongodb.com/manual/tutorial/ensure-indexes-fit-ram/ );使用适合内存的索引( https://docs.mongodb.com/manual/tutorial/ensure-indexes-fit-ram/ );
  4. Keep documents small;保持文件小;
  5. If you need phrase queries using text indices, make sure the entire collection fits in memory ( https://docs.mongodb.com/manual/core/index-text/#storage-requirements-and-performance-costs , last bullet);如果您需要使用文本索引的短语查询,请确保整个集合适合内存https://docs.mongodb.com/manual/core/index-text/#storage-requirements-and-performance-costs ,最后一个项目符号) ;
  6. Generate test data and make test queries, simulating your app use case;生成测试数据并进行测试查询,模拟您的应用用例; use random strings generators if needed.如果需要,使用随机字符串生成器。

I had problems with text queries in big documents, using MongoDB: Autocomplete and text search memory issues in apostrophe-cms: need ideas我在使用 MongoDB 时遇到大文档中的文本查询问题:撇号-cms 中的自动完成和文本搜索内存问题:需要想法

Here there is some code I wrote to generate sample data, in ApostropheCMS, and some test results: https://github.com/souzabrs/misc/tree/master/big-pieces .这里有一些我在 ApostropheCMS 中编写的用于生成示例数据的代码,以及一些测试结果: https : //github.com/souzabrs/misc/tree/master/big-pieces

This is more a database design issue than a MongoDB internal one.这与其说是 MongoDB 内部问题,不如说是数据库设计问题。 I think MongoDB was made to behave this way.我认为 MongoDB 就是这样做的。 But, it would help a lot to have more obvious explanation in its documentation.但是,在其文档中有更明显的解释会很有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM