简体繁体 English

maxOffsetsPerTrigger 与 spark 集群中的核心数

[英]maxOffsetsPerTrigger vs number of cores in the spark cluster

原文 2021-11-27 08:39:10 1 1 apache-spark/ apache-kafka/ databricks

For example, my spark structured streaming application has Kafka as the message source, and below are the details of the different configurations.例如，我的 Spark 结构化流应用程序以 Kafka 作为消息源，下面是不同配置的详细信息。

Kafka setup :卡夫卡设置：

Message source: kafka消息来源：kafka

Partitions: 40分区：40

input parameters :输入参数：

maxOffsetsPerTrigger: 1000 maxOffsetsPerTrigger：1000

Cluster setup :集群设置：

Number of workers = 5工人人数 = 5

Number of cores/worker = 8核心数/工人 = 8

Question :问题：

With the above setup, does it read通过上述设置，它是否读取

(1000 * 5 * 8) = 40000 messages every time (1000 * 5 * 8) = 40000 条消息

or或者

(1000 * 5) = 5000 messages every time (1000 * 5) = 每次 5000 条消息

or或者

read 1000 messages and distribute it across the 5 worker nodes?读取 1000 条消息并将其分发到 5 个工作节点？

1 个解决方案

Per documentation :根据文档：

Rate limit on maximum number of offsets processed per trigger interval.每个触发间隔处理的最大偏移数的速率限制。 The specified total number of offsets will be proportionally split across topicPartitions of different volume .指定的总偏移量将按比例分配到不同 volume 的 topicPartitions 中。

So it's the last option in your list, and each executor at maximum will process 200 offsets per trigger, split between individual cores (25 offsets/core).所以这是您列表中的最后一个选项，每个执行程序最多将处理每个触发器 200 个偏移量，在各个核心之间拆分（25 个偏移量/核心）。 But it could be smaller if you don't have enough data collected in the specific trigger period.但如果您在特定触发期间没有收集到足够的数据，它可能会更小。

Also, in new versions of Spark, there are additional options, like, minOffsetsPerTrigger that will allow to process bigger batches in case if your trigger period didn't have enough data to process.此外，在新版本的 Spark 中，还有其他选项，例如minOffsetsPerTrigger ，如果您的触发周期没有足够的数据来处理，则可以处理更大的批次。