简体   繁体   English

Azure Cosmos DB 聚合和索引

[英]Azure Cosmos DB aggregation and indexes

I'm trying to use Cosmos DB and I'm having some trouble making a simple count in a collection.我正在尝试使用 Cosmos DB,但在对集合进行简单计数时遇到了一些麻烦。

My collection schema is below and I have 80.000 documents in this collection.我的集合模式在下面,我在这个集合中有 80.000 个文档。

{
    "_id" : ObjectId("5aca8ea670ed86102488d39d"),
    "UserID" : "5ac161d742092040783a4ee1",
    "ReferenceID" : 87396,
    "ReferenceDate" : ISODate("2018-04-08T21:50:30.167Z"),
    "ElapsedTime" : 1694,
    "CreatedDate" : ISODate("2018-04-08T21:50:30.168Z")
}

If I run this command below to count all documents in collection, I have the result so quickly:如果我在下面运行这个命令来计算集合中的所有文档,我很快就会得到结果:

db.Tests.count()

But when I run this same command but to a specific user, I've got a message "Request rate is large".但是当我对特定用户运行相同的命令时,我收到一条消息“请求率很大”。

db.Tests.find({UserID:"5ac161d742092040783a4ee1"}).count()

In the Cosmos DB documentation I found this cenario and the suggestion is increase RU.在 Cosmos DB 文档中,我发现了这个场景,建议增加 RU。 Currently I have 400 RU/s, when I increase to 10.000 RU/s I'm capable to run the command with no errors but in 5 seconds.目前我有 400 RU/s,当我增加到 10.000 RU/s 时,我能够在 5 秒内无错误地运行命令。

I already tryed to create index explicity, but it seems Cosmos DB doesn't use the index to make count.我已经尝试明确创建索引,但似乎 Cosmos DB 没有使用索引来计算。

I do not think it is reasonable to have to pay 10,000 RU / s for a simple count in a collection with approximately 100,000 documents, although it takes about 5 seconds.我认为为一个包含大约 100,000 个文档的集合中的简单计数支付 10,000 RU / s 是不合理的,尽管它需要大约 5 秒。

Count by filter queries ARE using indexes if they are available.如果索引可用,则按过滤器查询计数使用索引。

If you try count by filter on a not indexed column the query would not time out, but fail.如果您在未索引的列上尝试按过滤器计数,则查询不会超时,但会失败。 Try it.试试吧。 You should get error along the lines of:你应该得到以下方面的错误:

{"Errors":["An invalid query has been specified with filters against path(s) excluded from indexing. Consider adding allow scan header in the request."]} {"Errors":["一个无效的查询被指定为针对从索引中排除的路径的过滤器。考虑在请求中添加允许扫描标头。"]}

So definitely add a suitable index on UserID .所以一定要在UserID上添加一个合适的索引。

If you don't have index coverage and don't get the above error then you probably have set the enableScanInQuery flag.如果您没有索引覆盖并且没有收到上述错误,那么您可能已经设置了enableScanInQuery标志。 This is almost always a bad idea, and full scan would not scale.这几乎总是一个坏主意,并且完全扫描不会扩展。 Meaning - it would consume increasingly large amounts of RU as your dataset grows.意思是 - 随着数据集的增长,它会消耗越来越多的 RU。 So make sure it is off and index instead.因此,请确保它已关闭并改为索引。

When you DO have index on the selected column your query should run.当您在所选列上有索引时,您的查询应该运行。 You can verify that index is actually being used by sending the x-ms-documentdb-populatequerymetrics header.您可以通过发送x-ms-documentdb-populatequerymetrics标头来验证是否实际使用索引 Which should return you confirmation with indexLookupTimeInMs and indexUtilizationRatio field.这应该通过indexLookupTimeInMsindexUtilizationRatio字段返回您的确认。 Example output:示例输出:

"totalExecutionTimeInMs=8.44;queryCompileTimeInMs=8.01;queryLogicalPlanBuildTimeInMs=0.04;queryPhysicalPlanBuildTimeInMs=0.06;queryOptimizationTimeInMs=0.00;VMExecutionTimeInMs=0.14; indexLookupTimeInMs=0.11 ;documentLoadTimeInMs=0.00;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=0;retrievedDocumentSize=0;outputDocumentCount=1;outputDocumentSize=0;writeOutputTimeInMs=0.01; indexUtilizationRatio=0.00 " “totalExecutionTimeInMs = 8.44; queryCompileTimeInMs = 8.01; queryLogicalPlanBuildTimeInMs = 0.04; queryPhysicalPlanBuildTimeInMs = 0.06; queryOptimizationTimeInMs = 0.00; VMExecutionTimeInMs = 0.14; indexLookupTimeInMs = 0.11; documentLoadTimeInMs = 0.00; systemFunctionExecuteTimeInMs = 0.00; userFunctionExecuteTimeInMs = 0.00; retrievedDocumentCount = 0; retrievedDocumentSize = 0; outputDocumentCount =1;outputDocumentSize=0;writeOutputTimeInMs=0.01; indexUtilizationRatio=0.00 "

It also provides you some insight where the effort has gone if you feel like RU charge is too large.如果您觉得 RU 费用太大,它还可以为您提供一些见解。

If index lookup time itself is too high, consider if you index is selective enough and if the index settings are suitable.如果索引查找时间本身太长,请考虑您的索引是否具有足够的选择性以及索引设置是否合适。 Look at your UserId values and distribution and adjust the index accordingly.查看您的UserId值和分布并相应地调整索引。


Another wild guess to consider is to check if the API you are using would defer executing find(..) until it knows that count() is really what you are after.另一个需要考虑的疯狂猜测是检查您正在使用的 API 是否会推迟执行find(..)直到它知道count()确实是您所追求的。 It is unclear which API you are using.目前尚不清楚您使用的是哪个 API。 If it turns out it is fetching all matching documents to client side before doing the counting then that would explain unexpectedly high RU cost, especially if there are large amount of matching documents or large documents involved.如果事实证明它在进行计数之前将所有匹配的文档提取到客户端,那么这将解释意外的高 RU 成本,尤其是当涉及大量匹配文档或大文档时。 Check the API documentation .检查 API 文档

I also suggest executing the same query directly in Azure Portal to compare the RU cost and verify if the issue is client-related or not.我还建议直接在 Azure 门户中执行相同的查询以比较 RU 成本并验证问题是否与客户端相关。

I think it just doesn't work.我认为这行不通。

The index seems to be used when selecting the documents to be counted, but then the count is done by reading each document, so effectively consuming a lot of RU.这个索引好像是在选择要统计的文档的时候用的,但是后来统计是通过读取每个文档来完成的,所以有效地消耗了很多RU。

This query is cheap and fast:这个查询既便宜又快速:

db.Tests.count({ UserID: { '$eq': '5ac161d742092040783a4ee1' }})

but this one is slow and expensive:但这一个又慢又贵:

db.Tests.count({ ReferenceID: { '$gt': 10 }})

even though this query is fast:即使这个查询很快:

db.Tests.find({ ReferenceID: { '$gt': 10 }}).sort({ ReferenceID: 1 })

I also found this: https://feedback.azure.com/forums/263030-azure-cosmos-db/suggestions/36142468-make-count-aware-of-indexes .我还发现了这个: https : //feedback.azure.com/forums/263030-azure-cosmos-db/suggestions/36142468-make-count-aware-of-indexes Note the status: "We have started work on this feature. Will update here when this becomes generally available."请注意状态:“我们已经开始研究此功能。当它普遍可用时将在此处更新。”

Pretty disappointing to be honest, especially since this limitation hasn't been addressed for almost 2 years.老实说非常令人失望,特别是因为这个限制已经近 2 年没有得到解决。 Note - I am not an expert in this matter and I'd love to be proven wrong, since I also need this feature.注意 - 我不是这方面的专家,我很想被证明是错误的,因为我也需要这个功能。

BTW: I noticed that simple indexes seem to be created automatically for each individual field, so no need to create them manually.顺便说一句:我注意到似乎为每个单独的字段自动创建了简单索引,因此无需手动创建它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM