简体繁体 English

如何为kafka主题选择分区数？

[英]How to choose the no of partitions for a kafka topic?

原文 2018-05-10 11:16:41 0 6 apache-kafka/ kafka-consumer-api/ kafka-producer-api

We have 3 zk nodes cluster and 7 brokers.我们有 3 个 zk 节点集群和 7 个代理。 Now we have to create a topic and have to create partitions for this topic.现在我们必须创建一个主题并且必须为这个主题创建分区。

But I did not find any formula to decide that how much partitions should I create for this topic.但是我没有找到任何公式来决定我应该为此主题创建多少分区。 Rate of producer is 5k messages/sec and size of each message is 130 Bytes.生产者的速率为 5k 条消息/秒，每条消息的大小为 130 字节。

Thanks In Advance提前致谢

6 个解决方案

I can't give you a definitive answer, there are many patterns and constraints that can affect the answer, but here are some of the things you might want to take into account:我不能给你一个明确的答案，有很多模式和限制会影响答案，但这里有一些你可能需要考虑的事情：

The unit of parallelism is the partition, so if you know the average processing time per message, then you should be able to calculate the number of partitions required to keep up.并行度的单位是分区，因此如果您知道每条消息的平均处理时间，那么您应该能够计算出保持同步所需的分区数。 For example if each message takes 100ms to process and you receive 5k a second then you'll need at least 50 partitions.例如，如果每条消息需要 100 毫秒来处理，而您每秒收到 5k，那么您至少需要 50 个分区。 Add a percentage more that that to cope with peaks and variable infrastructure performance.增加一个百分比以应对峰值和可变的基础设施性能。 Queuing Theory can give you the math to calculate your parallelism needs.排队论可以为您提供计算并行需求的数学方法。
How bursty is your traffic and what latency constraints do you have?您的流量有多突发以及您有哪些延迟限制？ Considering the last point, if you also have latency requirements then you may need to scale out your partitions to cope with your peak rate of traffic.考虑到最后一点，如果您也有延迟要求，那么您可能需要扩展分区以应对您的峰值流量。
If you use any data locality patterns or require ordering of messages then you need to consider future traffic growth.如果您使用任何数据局部性模式或需要对消息进行排序，那么您需要考虑未来的流量增长。 For example, you deal with customer data and use your customer id as a partition key, and depend on each customer always being routed to the same partition.例如，您处理客户数据并使用您的客户 ID 作为分区键，并且依赖于每个客户始终被路由到同一个分区。 Perhaps for event sourcing or simply to ensure each change is applied in the right order.也许是为了事件溯源，或者只是为了确保以正确的顺序应用每个更改。 Well, if you add new partitions later on to cope with a higher rate of messages, then each customer will likely be routed to a different partition now.好吧，如果您稍后添加新分区以应对更高的消息速率，那么现在每个客户可能会被路由到不同的分区。 This can introduce a few headaches regarding guaranteed message ordering as a customer exists on two partitions.由于客户存在于两个分区上，这可能会带来一些关于保证消息排序的问题。 So you want to create enough partitions for future growth.所以你想为未来的增长创建足够的分区。 Just remember that is easy to scale out and in consumers, but partitions need some planning, so go on the safe side and be future proof.请记住，这很容易扩展到消费者中，但分区需要一些规划，所以要安全起见并面向未来。
Having thousands of partitions can increase overall latency.拥有数千个分区会增加整体延迟。

This old benchmark by Kafka co-founder is pretty nice to understand the magnitudes of scale - https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines Kafka 联合创始人的这个旧基准非常适合理解规模的大小 - https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

The immediate conclusion from this, like Vanlightly said here , is that the consumer handling time is the most important factor in deciding on number of partition (since you are not close to challenge the producer throughput).由此得出的直接结论，就像Vanlightly 在这里所说的，是消费者处理时间是决定分区数量的最重要因素（因为您还没有接近挑战生产者吞吐量）。

maximal concurrency for consuming is the number of partitions, so you want to make sure that:消费的最大并发数是分区数，因此您要确保：

((processing time for one message in seconds x number of msgs per second) / num of partitions) << 1 （（一条消息的处理时间（以秒为单位） x每秒的消息数） /分区数） << 1

if it equals to 1, you cannot read faster than writing, and this is without mentioning bursts of messages and failures\\downtime of consumers.如果它等于 1，你的阅读速度不会比写作快，这更不用说消息的爆发和消费者的故障\\停机时间。 so you will need to it to be significantly lower than 1, how significant depends on the latency that your system can endure.因此您需要将其显着低于 1，具体程度取决于您的系统可以承受的延迟。

It depends on your required throughput, cluster size, hardware specifications:这取决于您所需的吞吐量、集群大小、硬件规格：

There is a clear blog about this written by Jun Rao from Confluent: How to choose the number of topics/partitions in a Kafka cluster? Confluent 的饶俊写了一篇关于这个的明确博客：如何选择 Kafka 集群中的主题/分区数量？

Also this might be helpful to have an insight: Apache Kafka Supports 200K Partitions Per Cluster这也可能有助于深入了解： Apache Kafka Supports 200K Partitions Per Cluster

Partitions = max(NP, NC)分区 = 最大（NP，NC）

where:在哪里：

NP is the number of required producers determined by calculating: TT/TP NC is the number of required consumers determined by calculating: TT/TC TT is the total expected throughput for our system TP is the max throughput of a single producer to a single partition TC is the max throughput of a single consumer from a single partition NP 是通过计算确定的所需生产者数量： TT/TP NC 是通过计算确定的所需消费者数量： TT/TC TT 是我们系统的总预期吞吐量 TP 是单个生产者对单个分区的最大吞吐量TC 是单个分区中单个消费者的最大吞吐量

You could choose the no of partitions equal to maximum of {throughput/#producer ;您可以选择等于 {throughput/#producer 的最大值的分区数； throughput/#consumer}.吞吐量/#consumer}。 The throughput is calculated by message volume per second.吞吐量按每秒消息量计算。 Here you have: Throughput = 5k * 130bytes = 650MB/s你有：吞吐量 = 5k * 130bytes = 650MB/s

For example, if you want to be able to read 1000MB/sec, but your consumer is only able process 50 MB/sec, then you need at least 20 partitions and 20 consumers in the consumer group.例如，如果您希望能够读取 1000MB/秒，但您的消费者只能处理 50MB/秒，那么您至少需要 20 个分区和消费者组中的 20 个消费者。 Similarly, if you want to achieve the same for producers, and 1 producer can only write at 100 MB/sec, you need 10 partitions.同理，如果要对生产者实现同样的效果，1 个生产者只能以 100 MB/秒的速度写入，则需要 10 个分区。 In this case, if you have 20 partitions, you can maintain 1 GB/sec for producing and consuming messages.在这种情况下，如果您有 20 个分区，则可以保持 1 GB/秒的速度来生成和使用消息。 You should adjust the exact number of partitions to number of consumers or producers, so that each consumer and producer achieve their target throughput.您应该将分区的确切数量调整为消费者或生产者的数量，以便每个消费者和生产者实现其目标吞吐量。

So a simple formula could be:所以一个简单的公式可以是：

#Partitions = max(NP, NC) where: #Partitions = max(NP, NC) 其中：

NP is the number of required producers determined by calculating: TT/TP NP是通过计算确定的所需生产者数量：TT/TP

NC is the number of required consumers determined by calculating: TT/TC NC 是通过计算确定的所需消费者数量：TT/TC

TT is the total expected throughput for our system TT 是我们系统的总预期吞吐量

TP is the max throughput of a single producer to a single partition TP 是单个生产者对单个分区的最大吞吐量

TC is the max throughput of a single consumer from a single partition TC 是单个分区中单个消费者的最大吞吐量

source : https://docs.cloudera.com/runtime/7.2.10/kafka-performance-tuning/topics/kafka-tune-sizing-partition-number.html来源： https : //docs.cloudera.com/runtime/7.2.10/kafka-performance-tuning/topics/kafka-tune-sizing-partition-number.html