简体   繁体   中英

retrieving a document by id is slow across partitions in cosmos db

I have a scenario where I need to retrieve a single document based on its id property from azure cosmos db. The only problem is I don't know the partition key and thus cannot use the document URI to access it.

From my understanding writing a simple query like

SELECT * from c WHERE c.id = "id here"

should be the way to go but I'm experiencing severe performance issues with this query. Most queries take 30s to 60s to complete and seem to consume insane amounts of RU/s. When executing 10 concurrent queries the max RU/s per partition went as high as 30.000. (10.00 per partition was provisioned) Resulting in throttling and even slower responses.

The collection comprises 10 partitions with around 3 Mb per partition, so 30 Mb in total and around 1,00,000 documents. My indexing policy looks like this:

{
    "indexingMode": "lazy",
    "automatic": true,
    "includedPaths": [
        {
            "path": "/*",
            "indexes": [
                {
                    "kind": "Range",
                    "dataType": "Number",
                    "precision": -1
                },
                {
                    "kind": "Hash",
                    "dataType": "String",
                    "precision": 3
                }
            ]
        }
    ],
    "excludedPaths": []
}

And the consistency is set to EVENTUAL since I don't really care about read/write order. The collection is under some write pressure with about 30 writes per minute and there's a TTL of 1 year for each document, yet this doesn't seem to produce a measurable impact on the RU/s. I experience this sort of problem only when querying documents.

Has anyone had similar problems and can offer a fix/mitigation? Am I doing something wrong with my query or indexing policy? I don't know why my query is consuming that much resources.

I got similar problem even. My database is 16 GB with 2 partitions and has 10,000 RU per partition.

By gathering query metrics, I found that query by id could be doing table scan and not looking up from index.

Here is the metrics of query by id:

SELECT * FROM c where c.id = 'id-here'
--Read 1 record in 1497.00 ms, 339173.109 RU
--QueryPreparationTime(ms): CompileTime = 2, LogicalBuildTime = 0, 
     PhysicalPlanBuildTime = 0, OptimizationTime = 0
--QueryEngineTime(ms): DocumentLoadTime = 1126, IndexLookupTime = 0, 
     RuntimeExecutionTimes = 356, WriteOutputTime = 0

Notice the time spent mostly in DocumentLoadTime and IndexLookupTime = 0 .

While query by indexed field is pretty fast.

SELECT * FROM c WHERE c.indexedField = 'value'
--Read 4 records in 2.00 ms, 7.56 RU
--QueryPreparationTime(ms): CompileTime = 0, LogicalBuildTime = 0, 
       PhysicalPlanBuildTime = 0, OptimizationTime = 0
--QueryEngineTime(ms): DocumentLoadTime = 0, IndexLookupTime = 1, 
       RuntimeExecutionTimes = 0, WriteOutputTime = 0

Contrast to the query by id, this doesn't consumed DocumentLoadTime as the index was used, IndexLookupTime is 1 ms.

The problem is id should be the primary key and should be indexed by default but it looks like it doesn't. You couldn't even add custom indexing policy for it.

I'm currently logged a ticket to Microsoft support and waiting for clarifications.

Update:

Microsoft support responded and they've resolved the issue. They've added IndexVersion 2 for the collection. Unfortunately, it is not yet available from the portal and newly created accounts/collection are still not using the new version. You'll have to contact Microsoft Support to made changes to your accounts.

Here are the new results from a collection with index version 2 and there's a massive improvement.

SELECT * FROM c where c.id = 'uniqueValue'
-- Index Version 1: Request Charge: 344,940.79 RUs
-- Index Version 2: Request Charge: 3.31 RUs

SELECT * FROM c WHERE c.indexedField = 'value' AND c.id = 'uniqueValue'
-- Index Version 1: Request Charge: 150,666.22 RUs 
-- Index Version 2: Request Charge: 5.65 RUs

My test DB about 300k record When i try to select with ID only like this

SELECT * FROM c where c.id = 'xxx'

It take me alot of time and RU

But when i try with partition key in that

SELECT * FROM c where c.id = 'xxx' AND c.partitionField = 'yyy'

It's very fast

So I think you must recontruct your db, and thinking which field to make a partition

The key to Cosmos is to Rethink the Partition Key . I don't know what you are using, but make it very available.

Recently I have been adding a 'Table' property to all of my documents, but you could very easily use the Table name as the Partition Key! It's really almost like having a bunch SQL tables just kind of floating around in the pudding that is a CosmosDB collection.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM