Kafka Producer NetworkException and Timeout Exceptions

Question

We are getting random NetworkExceptions and TimeoutExceptions in our production environment:

Brokers: 3
Zookeepers: 3
Servers: 3
Kafka: 0.10.0.1
Zookeeeper: 3.4.3

We are occasionally getting this exception in my producer logs:

Expiring 10 record(s) for TOPIC:XXXXXX: 5608 ms has passed since batch creation plus linger time.

Number of milliseconds in such error messages keep changing. Sometimes its ~5 seconds other times it's up to ~13 seconds !

And very rarely we get:

NetworkException: Server disconnected before response received.

Cluster consists of 3 brokers and 3 zookeepers . Producer server and Kafka cluster are in same network.

I am making synchronous calls. There's a web service to which multiple user requests call to send their data. Kafka web service has one Producer object which does all the sending. Producer's Request timeout was 1000ms initially that has been changed to 15000ms (15 seconds). Even after increasing timeout period TimeoutExceptions are still showing up in error logs.

What can be the reason?

Answer 1

It is a bit tricky to find the root cause, I'll drop my experience on that, hopefully someone may find it useful. In general it can be a network issue or too much network flood in combination with ack=ALL . Here a diagram that explain the TimeoutException from Kafka KIP-91 at he time of writing (still applicable till 1.1.0):

Excluding network configuration issues or errors, this are the properties you can adjust depending on your scenario in order to mitigate or solve the problem:

The buffer.memory controls the total memory available to a producer for buffering. If records get sent faster than they can be transmitted to Kafka then and this buffer will get exceeded then additional send calls block up to max.block.ms after then Producer throws a TimeoutException .
The max.block.ms has already a high value and I do not suggest to further increment it. buffer.memory has the default value of 32MB and depending on you message size you may want to increase it; if necessary increase the jvm heap space.
Retries define how many attempts to resend the record in case of error before giving up. If you are using zero retries you can try to mitigate the problem by increasing this value, beware record order is not guarantee anymore unless you set max.in.flight.requests.per.connection to 1.
Records are sent as soon as the batch size is reached or the linger time is passed, whichever comes first. if batch.size (default 16kb) is smaller than the maximum request size perhaps you should use a higher value. In addition, change linger.ms to a higher value such as 10, 50 or 100 to optimize the use of the batch and the compression. This will cause less flood in the network and optimize compression if you are using it.

There is not an exact answer on this kind of issues since they depends also on the implementation, in my case experimenting with the values above helped.

Answer 2

We have faced similar problem. Many NetworkExceptions in the logs and from time to time TimeoutException .

Cause

Once we gathered TCP logs from production it turned out that some of the TCP connections to Kafka brokers (we have 3 broker nodes) were dropped without notifying clients after like 5 minutes of being idle (no FIN flags on TCP layer). When client was trying to re-use this connection after that time, then RST flag was returned. We could easily match those connections resets in TCP logs with NetworkExceptions in application logs.

As for TimeoutException , we could not do the same matching as by the time we found the cause, this type of error was not occurring anymore. However we confirmed in a separate test that dropping TCP connection could also result in TimeoutException . I guess this is because of the fact that Java Kafka Client is using Java NIO Socket Channel under the hood. All the messages are being buffered and then dispatched once connection is ready. If connection will not be ready within timeout (30 seconds), then messages will expire resulting in TimeoutException .

Solution

For us the fix was to reduce connections.max.idle.ms on our clients to 4 minutes. Once we applied it, NetworkExceptions were gone from our logs.

We are still investigating what is dropping the connections.

Edit

The cause of the problem was AWS NAT Gateway which was dropping outgoing connections after 350 seconds.

https://docs.aws.amazon.com/vpc/latest/userguide/nat-gateway-troubleshooting.html#nat-gateway-troubleshooting-timeout

Answer 3

Solution 1

Modify

listeners=PLAINTEXT://hostname:9092

property in the server.properties file to

listeners=PLAINTEXT://0.0.0.0:9092

Solution 2

Change the broker.id to a value like 1001, change the brocker id by setting the environment variable KAFKA_BROKER_ID .

You'll have to set the environment variable KAFKA_RESERVED_BROKER_MAX_ID to omething like 1001 to be allowed to set the broker id to 1001.

I hope it can help

Answer 4

增加request.timeout.ms和生产者的重试

Answer 5

We tried everything, but no luck.

Decreased producer batch size and increased request.timeout.ms.
Restarted target kafka cluster, still no luck.
Checked replication on target kafka cluster, that as well was working fine.
Added retries, retries.backout.ms in prodcuer properties.
Added linger.time as well in kafka prodcuer properties.

Finally our case there was issue with kafka cluster itself, from 2 servers we were unable to fetch metadata in between.

When we changed target kafka cluster to our dev box, it worked fine.

Kafka Producer NetworkException and Timeout Exceptions

Question

4 answers

solution1
20 2018-04-30 09:06:23

solution2
8 2019-08-19 08:49:51

solution3
0 2017-11-10 16:58:14

solution4
0 2018-06-13 15:44:26

solution5
0 2019-11-27 11:06:46

Kafka Producer NetworkException and Timeout Exceptions

Question

4 answers

solution1 20 2018-04-30 09:06:23

solution2 8 2019-08-19 08:49:51

solution3 0 2017-11-10 16:58:14

solution4 0 2018-06-13 15:44:26

solution5 0 2019-11-27 11:06:46

solution1
20 2018-04-30 09:06:23

solution2
8 2019-08-19 08:49:51

solution3
0 2017-11-10 16:58:14

solution4
0 2018-06-13 15:44:26

solution5
0 2019-11-27 11:06:46