简体   繁体   中英

maxOffsetsPerTrigger vs number of cores in the spark cluster

For example, my spark structured streaming application has Kafka as the message source, and below are the details of the different configurations.

Kafka setup :

Message source: kafka

Partitions: 40

input parameters :

maxOffsetsPerTrigger: 1000

Cluster setup :

Number of workers = 5

Number of cores/worker = 8

Question :

With the above setup, does it read

(1000 * 5 * 8) = 40000 messages every time

or

(1000 * 5) = 5000 messages every time

or

read 1000 messages and distribute it across the 5 worker nodes?

Per documentation :

Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume .

So it's the last option in your list, and each executor at maximum will process 200 offsets per trigger, split between individual cores (25 offsets/core). But it could be smaller if you don't have enough data collected in the specific trigger period.

Also, in new versions of Spark, there are additional options, like, minOffsetsPerTrigger that will allow to process bigger batches in case if your trigger period didn't have enough data to process.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM