简体   繁体   中英

Spring Cloud Stream Kinesis Binder - Concurrency

I built a spring boot kinesis consumer with the following components:

  • spring boot (version - 2.1.2.RELEASE)
  • spring cloud (version - Greenwich.RELEASE)
  • spring cloud stream kinesis binder (version - 1.1.0.RELEASE)

I consume events from a kinesis stream with 1 shard . Also this spring boot consumer application is running in Pivotal Cloud Foundry Platform .

I tried the scenario locally (with kinesalite) and in PCF (with a kinesis stream) before posting this question. Can you please confirm whether my understanding is right? I went through the spring cloud stream documentation ( https://docs.spring.io/spring-cloud-stream/docs/current/reference/htmlsingle/ and https://github.com/spring-cloud/spring-cloud-stream-binder-aws-kinesis/blob/master/spring-cloud-stream-binder-kinesis-docs/src/main/asciidoc/overview.adoc ). Though the documentation is exhaustive, concurrency and high availability is not explained in detail.

Let's say I have 3 instances of the consumer deployed to PCF (by setting the instances attribute to 3 in the manifest.yml file which is used during cf push).

All 3 instances have the below properties :

spring.cloud.stream.bindings..consumer.concurrency=5

spring.cloud.stream.bindings..group=my-consumer-group

spring.cloud.stream.kinesis.binder.checkpoint.table=my-metadata-dynamodb-table

spring.cloud.stream.kinesis.binder.locks.table=my-locks-dynamodb-table

Let's say the events were sent to kinesis by the producer in this order

event5 (most recent event in the stream) - event4 - event3 - event2 - event1 (first event in the stream)

For such a configuration, I have explained my understanding below. Can you confirm whether this is right?

  1. Only one instance of the consumer is active at a given point of time and it will process all the events sent to the kinesis stream (as the stream has only one shard). One of the other 2 instances will take control only when the primary instance is down. This configuration is to ensure high availability and retain ordering of messages.
  2. Since the number of instances is set at the manifest.yml for PCF, I need not worry about setting spring.cloud.stream.instanceCount or spring.cloud.stream.bindings..consumer.instanceCount properties.
  3. 5 consumer threads are active (since concurrency is set to 5) when the spring boot consumer is started/launched. Now the events are consumed in the order explained above. Thread1 picks up event1. When thread1 is still actively processing event1, the other thread just picks and starts processing the next event from the stream (thread2 process event2 and so on...). Though the order of events are retained in this case (event 1 is always picked up before event2 and so on....), there is no guarantee that thread1 will finish processing event1 before thread 2.
  4. When all the 5 threads are busy processing the 5 events in the stream, if new events say event6 and event7 comes in, the consumer has to wait for a thread to become available. Say, thread3 is done processing event3 and other threads are still busy processing events, thread3 will pick up event6 and start processing but event7 is still not picked up as there are no available threads.
  5. By default concurrency is set to 1. If your business requirement demands you to complete processing the first event before picking up the next, then concurrency should be 1. You are compromising the throughput in this case. You can only consume one event at a time. But if throughput is important and you want to process more than one event at a given point of time, then the concurrency should be set to a desired value. Increasing the number of shards is also an option but as a consumer if you cannot demand for an increase, this is your best bet to achieve parallelism/throughput.

Please, see concurrency option JavaDocs in the KinesisMessageDrivenChannelAdapter :

/**
 * The maximum number of concurrent {@link ConsumerInvoker}s running.
 * The {@link ShardConsumer}s are evenly distributed between {@link ConsumerInvoker}s.
 * Messages from within the same shard will be processed sequentially.
 * In other words each shard is tied with the particular thread.
 * By default the concurrency is unlimited and shard
 * is processed in the {@link #consumerExecutor} directly.
 * @param concurrency the concurrency maximum number
 */
public void setConcurrency(int concurrency) {

So, since you have only one shard in that one stream, there is going to be only one active thread which iterates over ShardIterator s on that single shard.

The point is that we always have to process records from a single shard in a single thread. This way we guarantee a proper order, plus checkpoint is done for the highest sequence number.

Please, investigate more what is AWS Kinesis and how it works.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM