简体繁体 English

Azure Cosmos DB 分区键 - 主键是否可以接受？

[英]Azure Cosmos DB partition key - is primary key acceptable?

原文 2018-06-27 21:01:22 7 4 azure/ azure-cosmosdb/ database-partitioning

Our Azure Cosmos DB collection has gotten large enough to require a partition key.我们的 Azure Cosmos DB 集合已经足够大，需要一个分区键。 In doing some reading about this, I get the impression that the best partition key is one that provides for even distribution and higher cardinality.在阅读有关此内容时，我的印象是最好的分区键是提供均匀分布和更高基数的键。 This article from Microsoft discusses it. Microsoft 的这篇文章对此进行了讨论。

Using a primary key as a partition key provides for even distribution, but a cardinality of only 1. If this is my only option, is this a bad thing?使用主键作为分区键提供了均匀分布，但基数仅为 1。如果这是我唯一的选择，这是一件坏事吗？ The aforementioned article gives a few examples and seems to indicate that the primary key should be used as a partition key in those instances.前面提到的文章给出了几个例子，似乎表明在这些情况下应该将主键用作分区键。 In the case of Azure Cosmos DB, the partitions are logical, not physical.对于 Azure Cosmos DB，分区是逻辑的，而不是物理的。 So it wouldn't lead to having each document on its own disk, but it seems like it could lead to a bloated index.所以它不会导致每个文档都在自己的磁盘上，但它似乎可能导致索引膨胀。

Is using a primary key as a partition key a common practice?使用主键作为分区键是一种常见做法吗？ Are there any downsides to it?它有什么缺点吗？

4 个解决方案

Actually , the choice of partition key is a question that deserves to be weighed repeatedly.其实，分区键的选择是一个值得反复权衡的问题。 Since choosing primary key to be the partition key is your only option, I just discuss some of the possible negative things as your references.由于选择主键作为分区键是您唯一的选择，我只讨论一些可能的负面事情作为您的参考。

In terms of performance, if your query's field is not partition key, your query will definitely reduce query performance by crossing partitions.在性能方面，如果你查询的字段不是partition key，你的查询肯定会因为跨分区而降低查询性能。 Arguably, if the amount of data is small, it won't have much effect.可以说，如果数据量很小，它不会有太大的影响。

In terms of cost, cosmos db is charged primarily by storage space and RUs consumption.As you said, choosing primary key as partition key will lead more indexes storage.成本方面，cosmos db主要是按存储空间和RUs消耗来收费的。如你所说，选择主键作为分区键会导致更多的索引存储。 If mostly queries are cross-partition, it also leads more RUs consumption.如果大多数查询是跨分区的，也会导致更多的 RU 消耗。

In terms of using of stored procedure, triggers or UDF, you can't use cross-partition transactions via stored procedures and triggers.在使用存储过程、触发器或UDF方面，不能通过存储过程和触发器使用跨分区事务。 Because then are partitioned so that you need to specify the partition key(cardinality is only 1) when you use them.因为 then 是分区的，所以在使用它们时需要指定分区键（基数仅为 1）。

Just note that if partition key is created, it cannot be deleted or modified later.请注意，如果创建了分区键，则以后无法删除或修改它。 So consider it before you choose and make sure you do the data backup.因此，在选择之前请考虑一下并确保进行数据备份。

More details, still refer to theofficial doc .更多细节，还是参考官方文档。

No, there is no downside to it.不，它没有缺点。 Strive to have partition key with high cardinality.争取拥有高基数的分区键。 Don't worry about indexes or physical partitions etc.不要担心索引或物理分区等。

You can have million of partition keys and 10 physical partitions.您可以拥有数百万个分区键和 10 个物理分区。 Physical partitions are created behind the scene by CosmosDB.物理分区由 CosmosDB 在后台创建。 You should never worry about physical partitions.您永远不应该担心物理分区。

You could say that the primary key is the safest and probably, most appropriate choice for a partition key.您可以说主键是分区键最安全、可能也是最合适的选择。

It guarantees uniqueness of the value, which other than unique keys, is the only way to achieve.它保证值的唯一性，这是实现唯一键以外的唯一方法。 The distribution will be even and because the primary key will be your partition key, you will be able to use it in order to retrieve the document by reading it, instead of querying, which reduces operation speed and cost.分布将是均匀的，因为主键将是您的分区键，您将能够使用它来通过读取而不是查询来检索文档，从而降低了操作速度和成本。

I think that MS does not do a great job of describing how to best determine a partition key for Cosmos DB - especially if folks are generally suggesting to use the Primary Key of the database as the partition key (which may be perfectly acceptable sometimes , but I can't see how it would be the normal).我认为 MS 在描述如何最好地确定 Cosmos DB 的分区键方面做得并不好 - 特别是如果人们通常建议使用数据库的主键作为分区键（有时这可能是完全可以接受的，但是我看不出这是正常的）。

In a recent project, this is how we decided to identify a partition key and item id for the objects in our system.在最近的一个项目中，这就是我们决定为系统中的对象识别分区键和项目 ID 的方式。 I think this would apply to many systems that have natural composite primary key candidates on their objects.我认为这适用于许多在其对象上具有自然复合主键候选的系统。

In our system, every object is restrict to a state (StateCode) and vendor (VendorId).在我们的系统中，每个对象都限于一个状态 (StateCode) 和供应商 (VendorId)。 From there, we have multiple entities like Sales Orders, Customers, Widgets, ... In our SQL Server implementation, every table had an obvious natural composite primary key of StateCode, VendorId, EntityId.从那里，我们有多个实体，如销售订单、客户、小部件……在我们的 SQL Server 实现中，每个表都有一个明显的自然复合主键 StateCode、VendorId、EntityId。 In the Cosmos DB scenario, we chose the Partition Key to be StateCode-Vendor-EntityType with an Item Id of EntityId.在 Cosmos DB 场景中，我们选择 Partition Key 为 StateCode-Vendor-EntityType，Item Id 为 EntityId。 This allows all the entities of a specific type to be queried within a partition (saving RUs) while still allowing very simple querying within that partition (eg, homogenous entities).这允许在分区内查询特定类型的所有实体（节省 RU），同时仍然允许在该分区内进行非常简单的查询（例如，同类实体）。 You end up utilizing all parts of the composite natural key in this way, but allow for actual partitioning of entities.您最终以这种方式使用了复合自然键的所有部分，但允许对实体进行实际分区。

In more complicated scenarios, where we wanted to query across entities for a given vendor, we can remove EntityType from the partition key and either move it into the item id or use it to filter the objects being searched.在更复杂的场景中，我们想要跨实体查询给定供应商，我们可以从分区键中删除 EntityType 并将其移动到项目 id 中或使用它来过滤正在搜索的对象。 This allows cross entity querying within a partition, but the query itself is slightly more complicated because of heterogenous entities.这允许在分区内进行跨实体查询，但由于异构实体，查询本身稍微复杂一些。

If the entire ID of the entity is in the Partition Key, then you pretty much have to always look up the item individually or search every partition when not looking up by ID - at which point who cares how evenly your data is distributed across partitions if you have to search them all anyway.如果实体的整个 ID 都在分区键中，那么您几乎必须始终单独查找项目或在不按 ID 查找时搜索每个分区 - 在这一点上，谁会关心您的数据在分区之间的分布是否均匀，如果无论如何你都必须搜索它们。

Perhaps the OP can describe more about the entities - do they have natural composite key candidates (regardless of whether they're being used or not in SQL implementation)?也许 OP 可以更多地描述实体——它们是否有自然的复合键候选（无论它们是否在 SQL 实现中使用）？ If not, what does the current persistence layer look like in terms of identifying items in the system by some id?如果不是，那么当前的持久层在通过某些 id 识别系统中的项目方面是什么样子的？