简体   繁体   中英

Store “extended” metadata on entities stored in Azure Cosmos DB as JSON documents

We are building a REST API in .NET deployed to Azure App Service / Azure API App. From this API, client can create "Products" and query "Products". The product entity has a set of fields that are common, and that all clients have to provide when creating a product, like the fields below (example)

{
"id": "cbf3f7aa-4743-4198-b307-260f703c42c1"
"name": "Product One"
"description": "The number one product"
}

We store these products currently as self-contained documents in Azure Cosmos DB.

Question 1: Partitioning. The collection will not store a huge amount of documents, we talk about maximum around 2 500 000 documents between 1 - 5 kb each (estimates). We currently have chosen the id field (which is our system generated id, not the internal Cosmos DB document id) as partition key which means 2 500 000 logical partitions with one document each partition. The documents will be used in some low-latency workloads, but these workloads will query by id (the partition key). Clients will also query by eg name, and then we have a fan-out query, but those queries will not be latency-critical. In the portal, you can't create a single partition collection anymore, but you can do it from the SDK or have a fixed partition key value. If we have all these documents in one single partition (we talk about data far below 10 GB here), we will never get any fan-out queries, but rely more on the index within the one logical partition. So the question: Even if we don't have huge amounts of data, is it still wise to partition like we currently have done?

Question 2: Extended metadata. We will face clients that want to write client/application/customer-specific metadata beyond the basic common fields. What is the best way to do this?

Some brainstorming from me below.

1: Just dump everything in one self-contained document.

One option is to allow clients in the API to add a type of nested "extendedMetadata" field with key-value pairs when creating a product. Cosmos DB is schema agnostic, so in theory this should work fine. Some products can have zero extended metadata, while other products can have a lot of extended metadata. For the clients, we can promise the basic common fields, but for the extended metadata field we cannot promise anything in terms of number of fields, naming etc. The document size will then vary. These products will as mentioned still be used in latency-critical workloads that will query by "id" (the partition key"). The extended metadata will never be used in any latency-critical workloads. How much and how in general affects the document size the performance / throughput? For the latency-critical read scenario, the query optimizer will go straight to the right partition, and then use the index to quickly retrieve the document fields of interest. Or will the whole document always be loaded and processed independent of which fields you want to query?

{
"id": "cbf3f7aa-4743-4198-b307-260f703c42c1"
"name": "Product One"
"description": "The number one product"
"extendedMetadta" : {
    "prop1": "prop1",
    "prop2": "prop2", 
    "propN": "propN"
}
}

The extended metadata is only useful to retrieve from the same API in certain situations. We can then do something like:

  • api.org.com/products/{id} -- will always return a product with the basic common fields
  • api.org.com/products/{id}/extended -- will return the full document (basic + extended metadata)

2: Split the document

One option might be to do some kind of splitting. If a client from the API creates a product that contains extended metadata, we can implement some logic that splits the document if extendedMetadata contains data. I guess the split can be done in many ways, brainstorming below. I guess the main objetive to split the documents (which require more work on write operations) is to get better throughput in case the document size plays a significant role here (in most cases, the clients will be ok with the basic common fields).

  • One basic document that only contains the basic common fields, and one extended document that (with the same id) contains the basic common fields + extended metadata (duplication of the basic common fields) We can add a "type" field that differentiates between the basic and extended document. If a client asks for extended, we will only query documents of type "extended".
  • One basic document that only contains the basic common fields + a reference to an extended document that only contains the extended metadata. This means a read operation where client asks for product with extended metadata require reading two documents.
  • Look into splitting it in different collections, one collection holds the basic documents with throughput dedicated to low-latency read scenarios, and one collection for the extended metadata.

Sorry for a long post. Hope this was understandable, looking forward for your feedback!

Answer 1:

If you can guarantee that the documents total size will never be more than 10GB, then creating a fixed collection is the way to go for 2 reasons. First, there is no need for a cross partition query. I'm not saying it will be lightning fast without partitioning but because you are only interacting with a simple physical partition, it will be faster than going in every single physical partition looking for data.

(Keep in mind however that every time people think that they can guarantee things like max size of something, it usually doesn't work out.)

The /id partitioning strategy is only efficient if you can ALWAYS provide the id. This is called a read. If you need to search by any other property, this means that you are performing a query. This is where the system wouldn't do so well.

Ideally you should design your Cosmos DB collection in a way that you never do a cross partition query as part of your every day work load. Maybe once in a blue moon for reporting reasons.

Answer 2:

Cosmos DB is a NoSQL schema-less database for a reason. The second approach in your brainstorming would be fitting for a traditional RDBMS database but we don't have that here. You can simply go with your first approach and either have everything under a single property or just have them at the top level.

Remember that you can just map the response to any object that you want, so you can simply have 2 DTOs. A slim and an extended version and just map to different versions depending on the endpoint.

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM