简体繁体 English

为Cassandra表选择分区键 - 分区太多了多少？

[英]Choosing a partition key for a Cassandra table — how many is too many partitions?

原文 2015-06-04 15:50:16 9 3 cassandra/ data-modeling

I have an application where the 'natural' partition key for a Cassandra table seems like it would be 'customer'. 我有一个应用程序，其中Cassandra表的“自然”分区键似乎是“客户”。 This is the primary way we want to query the data, we would get good data distribution, etc. 这是我们想要查询数据的主要方式，我们可以获得良好的数据分布等。

But if there were well over 1 million customers, would that be too many different partitions? 但如果有超过100万的客户，那会是太多不同的分区吗？

Should I choose a partition key that results in a smaller number of partition keys? 我应该选择导致分区键数量较少的分区键吗？

I've looked at a number of the related questions on this topic but none seem to address this particular point. 我已经看了很多关于这个主题的相关问题，但似乎都没有解决这个问题。

3 个解决方案

But if there were well over 1 million customers, would that be too many different partitions? 但如果有超过100万的客户，那会是太多不同的分区吗？

No. The Murmur3Partitioner can handle something like 2^64 (-2^63 to +2^63) partitions. 不.Murmur3Partitioner可以处理类似2 ^ 64（-2 ^ 63到+ 2 ^ 63）分区的内容。 Cassandra is designed to be very good at storing large amounts of data and retrieving by partition key. Cassandra旨在非常擅长存储大量数据并通过分区键进行检索。 There are restrictions on the number of columns within a partition (2 billion), but for total number of partitions I think you'll be fine with what you have. 有在列的分区（2十亿）内的数量限制，但对于分区的总数，我认为你会没事的你所拥有的东西。

Should I choose a partition key that results in a smaller number of partition keys? 我应该选择导致分区键数量较少的分区键吗？

Definitely not. 当然不。 That could cause your partitions to grow too big, and/or develop "hot spots" in your cluster. 这可能会导致分区变得过大，和/或在群集中形成“热点”。

The main task behind picking a good partition key, is to find one that (both) offers good data distribution in the cluster, and matches your query patterns. 选择一个好的分区密钥背后的主要任务是找到一个（两者）在集群中提供良好的数据分布，并匹配您的查询模式。 And from what I'm reading, it sounds like you have done exactly that. 从我正在阅读的内容来看，听起来你完全就是这么做的。

I think you misunderstand how the partition key is used. 我想您误解了如何使用分区键。 The recommended partitioner takes your partition key values and then computes a 128 bit hash from them. 建议的分区程序获取您的分区键值，然后从它们计算128位哈希值。 The hash is called the token of the record, and it is that token value that determines where your record is stored. 散列称为记录的标记，它是确定记录存储位置的标记值。 Each Cassandra node has a set of token ranges associated with it. 每个Cassandra节点都有一组与之关联的令牌范围。 If the token of a record falls with a range of a node, the record is stored on that node. 如果记录的令牌落在节点的范围内，则该记录存储在该节点上。 The number of partitions is not determined by your choice of partition key: it is the number of token ranges in your cluster. 分区数不是由您选择的分区键确定的：它是群集中令牌范围的数量。 That is roughly equal to the total number of vnodes you selected when you configured your data store nodes. 这大致等于配置数据存储节点时选择的vnode总数。

You are good to go with your current partition key. 您最好使用当前的分区键。 No need to go for composite partition key to drive more partitions. 无需使用复合分区键来驱动更多分区。 Are you doing any time series data modelling, growing more columns per second kinda thing. 你在做任何时间序列数据建模，每秒增加更多列有点事。 If NOT, your current partition key can go for many million customers. 如果不是，您当前的分区密钥可以用于数百万客户。