[英]Akka Kafka Consumer processing rate decreases drastically when lag is there in our Kafka Partitions
We are facing a scenario where our akka-stream-kaka-consumer processing rate is decreasing whenever there is a lag.我们正面临这样一种情况:只要有延迟,我们的 akka-stream-kaka-consumer 处理率就会下降。 When we start it without any lag in partitions, processing rate increases suddenly.当我们在没有任何分区延迟的情况下启动它时,处理速度会突然增加。
MSK cluster - 10 topics - 40 partitions each => 400 total leader partitions MSK 集群 - 10 个主题 - 每个 40 个分区 => 400 个领导分区
To achieve high throughput and parallelism in system we implemented akka-stream-kafka consumers subscribing to each topic-partition separately resulting in 1:1 mapping between consumer and partition.为了在系统中实现高吞吐量和并行性,我们实现了 akka-stream-kafka 消费者分别订阅每个主题分区,从而在消费者和分区之间实现 1:1 映射。
Here is consumer setup:这是消费者设置:
So, in total we are starting 420 consumers spread across different instances.因此,我们总共启动了 420 个消费者,分布在不同的实例中。 As per the RangeAssignor Partition strategy (Default one), each partition will get assigned to different consumer and 400 consumer will use 400 partitions and 20 consumers will remain unused.根据 RangeAssignor 分区策略(默认),每个分区将分配给不同的消费者,400 个消费者将使用 400 个分区,20 个消费者将保持未使用。 We have verified this allocation and looks good.我们已经验证了这个分配,看起来不错。
Instance Type used: c5.xlarge使用的实例类型: c5.xlarge
MSK Config: MSK 配置:
Apache Kafka version - 2.4.1.1 Apache Kafka 版本- 2.4.1.1
Total number of brokers - 9 ( spread across 3 AZs)经纪人总数- 9(分布在 3 个可用区)
Broker Type: kafka.m5.large经纪商类型: kafka.m5.large
Broker per Zone: 3每个区域的经纪人: 3
auto.create.topics.enable =true auto.create.topics.enable =true
default.replication.factor =3 default.replication.factor =3
min.insync.replicas =2 min.insync.replicas =2
num.io.threads =8数量.io.threads = 8
num.network.threads =5 num.network.threads = 5
num.partitions =40 num.partitions = 40
num.replica.fetchers =2 num.replica.fetchers =2
replica.lag.time.max.ms =30000 replica.lag.time.max.ms =30000
socket.receive.buffer.bytes =102400 socket.receive.buffer.bytes =102400
socket.request.max.bytes =104857600 socket.request.max.bytes =104857600
socket.send.buffer.bytes =102400 socket.send.buffer.bytes =102400
unclean.leader.election.enable =true unclean.leader.election.enable =true
zookeeper.session.timeout.ms =18000 zookeeper.session.timeout.ms =18000
log.retention.ms =259200000 log.retention.ms =259200000
This is the configuration we are using for each consumers这是我们为每个消费者使用的配置
akka.kafka.consumer {
kafka-clients {
bootstrap.servers = "localhost:9092"
client.id = "consumer1"
group.id = "consumer1"
auto.offset.reset="latest"
}
aws.glue.registry.name="Registry1"
aws.glue.avroRecordType = "GENERIC_RECORD"
aws.glue.region = "region"
kafka.value.deserializer.class="com.amazonaws.services.schemaregistry.deserializers.avro.AWSKafkaAvroDeserializer"
# Settings for checking the connection to the Kafka broker. Connection checking uses `listTopics` requests with the timeout
# configured by `consumer.metadata-request-timeout`
connection-checker {
#Flag to turn on connection checker
enable = true
# Amount of attempts to be performed after a first connection failure occurs
# Required, non-negative integer
max-retries = 3
# Interval for the connection check. Used as the base for exponential retry.
check-interval = 15s
# Check interval multiplier for backoff interval
# Required, positive number
backoff-factor = 2.0
}
}
akka.kafka.committer {
# Maximum number of messages in a single commit batch
max-batch = 10000
# Maximum interval between commits
max-interval = 5s
# Parallelism for async committing
parallelism = 1500
# API may change.
# Delivery of commits to the internal actor
# WaitForAck: Expect replies for commits, and backpressure the stream if replies do not arrive.
# SendAndForget: Send off commits to the internal actor without expecting replies (experimental feature since 1.1)
delivery = WaitForAck
# API may change.
# Controls when a `Committable` message is queued to be committed.
# OffsetFirstObserved: When the offset of a message has been successfully produced.
# NextOffsetObserved: When the next offset is observed.
when = OffsetFirstObserved
}
akka.http {
client {
idle-timeout = 10s
}
host-connection-pool {
idle-timeout = 10s
client {
idle-timeout = 10s
}
}
}
consumer.parallelism=1500
We are using below code to to materialised the flow from Kafka to empty sink我们使用下面的代码来实现从 Kafka 到空接收器的流
override implicit val actorSystem = ActorSystem("Consumer1")
override implicit val materializer = ActorMaterializer()
override implicit val ec = system.dispatcher
val topicsName = "Set of Topic Names"
val parallelism = conf.getInt("consumer.parallelism")
val supervisionDecider: Supervision.Decider = {
case _ => Supervision.Resume
}
val commiter = committerSettings.getOrElse(CommitterSettings(actorSystem))
val supervisionStrategy = ActorAttributes.supervisionStrategy(supervisionDecider)
Consumer
.committableSource(consumerSettings, Subscriptions.topics(topicsName))
.mapAsync(parallelism) {
msg =>
f(msg.record.key(), msg.record.value())
.map(_ => msg.committableOffset)
.recoverWith {
case _ => Future.successful(msg.committableOffset)
}
}
.toMat(Committer.sink(commiter).withAttributes(supervisionStrategy))(DrainingControl.apply)
.withAttributes(supervisionStrategy)
Library versions in code代码中的库版本
"com.typesafe.akka" %% "akka-http" % "10.1.11",
"com.typesafe.akka" %% "akka-stream-kafka" % "2.0.3",
"com.typesafe.akka" %% "akka-stream" % "2.5.30"
The observation are as follows,观察如下,
We want all the consumers to be running in parallel and processing the messages in real time.我们希望所有消费者并行运行并实时处理消息。 This lag of 3 days in processing is causing a major downtime for us. 3 天的处理延迟导致我们严重停机。 I tried following the given link but we are already on the fixed version https://github.com/akka/alpakka-kafka/issues/549我尝试按照给定的链接,但我们已经在固定版本https://github.com/akka/alpakka-kafka/issues/549
Can anyone help what we are missing in terms of configuration of consumer or some other issue.任何人都可以帮助我们在消费者配置或其他一些问题方面所缺少的东西。
That lag graph seems to me to indicate that your overall system isn't capable of handling all the load, and it almost looks like only one partition at a time is actually making progress.在我看来,该滞后图表明您的整个系统无法处理所有负载,而且几乎看起来一次只有一个分区实际上在取得进展。
That phenomenon indicates to me that the processing being done in f
is ultimately gating on the rate at which some queue can be cleared, and that the parallelism in the mapAsync
stage is too high, effectively racing the partitions against each other.这种现象向我表明,在f
中进行的处理最终取决于可以清除某些队列的速率,并且mapAsync
阶段中的并行度太高,有效地使分区相互竞争。 Since the Kafka consumer batches records (by default in batches of 500, assuming that the consumer's lag is more than 500 records) if tha parallelism is higher than that, all of those records enter the queue at basically the same time as a block.由于Kafka消费者批量记录(默认为500条,假设消费者的滞后超过500条记录)如果并行度高于此值,所有这些记录基本上与块同时进入队列。 It looks like the parallelism in the mapAsync
is 1500;看起来mapAsync
的并行mapAsync
是1500; given the apparent use of the Kafka default 500 batch size, this seems way too high: there's no reason for it to be greater than the Kafka batch size, and if you want a more even consumption rate between partitions, it should be a lot less than that batch size.考虑到 Kafka 默认 500 批量大小的明显使用,这似乎太高了:没有理由让它大于 Kafka 批量大小,如果您想要分区之间更均匀的消耗率,它应该少很多比那个批量大小。
Without details on what happens in f
, it's hard to say what that queue is and how much parallelism should be reduced.如果没有关于f
发生的事情的详细信息,很难说该队列是什么以及应该减少多少并行度。 But there are some general guidelines I can share:但我可以分享一些一般准则:
akka-http
, it's possible that the processing of messages in f
involves sending an HTTP request to some other service.由于您包含akka-http
,因此f
的消息处理可能涉及向某些其他服务发送 HTTP 请求。 In that case, it's important to remember that Akka HTTP maintains a queue per targeted host;在这种情况下,重要的是要记住 Akka HTTP 为每个目标主机维护一个队列; it's also likely that there's a queue on the target side which governs throughput there.目标端也可能有一个队列来控制那里的吞吐量。 This is somewhat a special case of the second (I/O bound) situation.这是第二种(I/O 绑定)情况的特殊情况。The I/O bound/blocking situation will be evidenced by very low CPU utilization on your instances. I/O 绑定/阻塞情况将通过实例上的 CPU 利用率非常低来证明。 If you're filling the queue per targeted host, you'll see log messages about "Exceeded configured max-open-requests value".如果您按目标主机填充队列,您将看到有关“超出配置的最大打开请求值”的日志消息。
Another thing worth noting is that because the Kafka consumer is inherently blocking, the Alpakka Kafka consumer actors run in their own dispatcher, whose size is by default 16, meaning that per host, only at most 16 consumers or producers can be working at a time.另一个值得注意的是,由于Kafka消费者天生就是阻塞的,Alpakka Kafka消费者actor在自己的调度器中运行,其大小默认为16,这意味着每个主机最多只能有16个消费者或生产者同时工作. Setting akka.kafka.default-dispatcher.thread-pool-executor.fixed-pool-size
to at least the number of consumers your app starts up (42 in your 6 consumers each per 7 topic configuration) is probably a good idea.将akka.kafka.default-dispatcher.thread-pool-executor.fixed-pool-size
设置为至少您的应用程序启动的消费者数量(每 7 个主题配置 6 个消费者中的 42 个)可能是一个好主意。 Thread starvation in the Alpakka Kafka dispatcher can cause consumer rebalances which will disrupt consumption. Alpakka Kafka 调度程序中的线程饥饿会导致消费者重新平衡,这将破坏消费。
Without making any other changes, I would suggest, for a more even consumption rate across partitions, setting在不进行任何其他更改的情况下,我建议,为了跨分区更均匀的消耗率,设置
akka.kafka.default-dispatcher.thread-pool-executor.fixed-pool-size = 42
consumer.parallelism = 50
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.