简体   繁体   中英

Tracking down high latency in Kafka

I have Kafka setup in an admittedly slow configuration - but I'm not expecting the numbers I'm seeing.

I have the cluster set to LogAppendTime , so I'm measuring the time between the event is written to Kafka (as decided by the broker) and the time its received by the service. Both the brokers and the application are "co-located" so the ping time between the servers is low and the clocks should be sync'ed or close to it.

I am seeing latencies between 2ms and 600ms , alot are 250ms +...the massive difference makes me think something is up with my setup. It varies between consumer groups also.

Kafka v2.7.0 x 4 brokers

Key broker properties:

default.replication.factor = 4
min.insync.replicas = 2
num.partitions = 50
offsets.topic.num.partitions = 50
offsets.topic.replication.factor = 4
transaction.state.log.min.isr = 2
transaction.state.log.num.partitions = 50
transaction.state.log.replication.factor = 4

Key consumer properties:

fetch.max.wait.ms = 500
fetch.min.bytes = 1
isolation_level = read_committed

Key producer properties:

enable.idempotence = true
linger.ms = 0
transaction.id = <id>

I'm using a transactional producer to commit the offsets with producer.sendOffsetsToTransaction() .

There are many consumer groups, all of which are transactional and operate in the same way of reading events, then committing new events along with the new offset.

Are there any settings I'm missing? I know Kafka isn't geared towards low latency but I want to setup to achieve the lowest latency I can...hopefully < 20ms .

EDIT

I've also tried with these settings:

default.replication.factor = 2
min.insync.replicas = 1
num.partitions = 50
offsets.topic.num.partitions = 50
offsets.topic.replication.factor = 2
transaction.state.log.min.isr = 1
transaction.state.log.num.partitions = 50
transaction.state.log.replication.factor = 2

Without using transactions and

enable.idempotence = false

num.partitions = 50 and offsets.topic.num.partitions = 50 , the number of brokers is 4. I believe that is causing an issue as your cluster is spending a lot of time doing replication and fetching replicas so time is spend in network communications. Also the I/O threads would be limited causing issues.

Reduce the number of partitions if possible. Is there a reason to have 50 partitions on a 4 node cluster?

Suggest you play with the following settings

  • num.network.threads
  • queued.max.requests.
  • num.io.threads.
  • num.replica.fetchers.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM