简体   繁体   English

Kafka Producer NetworkException 和超时异常

[英]Kafka Producer NetworkException and Timeout Exceptions

We are getting random NetworkExceptions and TimeoutExceptions in our production environment:我们在生产环境中收到随机的NetworkExceptionsTimeoutExceptions

Brokers: 3
Zookeepers: 3
Servers: 3
Kafka: 0.10.0.1
Zookeeeper: 3.4.3

We are occasionally getting this exception in my producer logs:我们偶尔会在我的生产者日志中收到此异常:

Expiring 10 record(s) for TOPIC:XXXXXX: 5608 ms has passed since batch creation plus linger time.过期 10 条 TOPIC:XXXXXX 记录:5608 毫秒自批处理创建以来已过去,加上逗留时间。

Number of milliseconds in such error messages keep changing.此类错误消息中的毫秒数不断变化。 Sometimes its ~5 seconds other times it's up to ~13 seconds !有时它是~5 秒,有时它高达~13 秒

And very rarely we get:我们很少得到:

NetworkException: Server disconnected before response received. 

Cluster consists of 3 brokers and 3 zookeepers . Cluster 由3 个broker3 个zookeeper 组成 Producer server and Kafka cluster are in same network.生产者服务器和 Kafka 集群在同一网络中。

I am making synchronous calls.我正在进行同步调用。 There's a web service to which multiple user requests call to send their data.有一个 Web 服务可供多个用户请求调用以发送他们的数据。 Kafka web service has one Producer object which does all the sending. Kafka Web 服务有一个 Producer 对象来完成所有的发送。 Producer's Request timeout was 1000ms initially that has been changed to 15000ms (15 seconds).生产者的请求超时最初是1000 毫秒,现在已更改为15000 毫秒(15 秒)。 Even after increasing timeout period TimeoutExceptions are still showing up in error logs.即使在增加超时期限后TimeoutExceptions仍然显示在错误日志中。

What can be the reason?原因是什么?

It is a bit tricky to find the root cause, I'll drop my experience on that, hopefully someone may find it useful.找到根本原因有点棘手,我会放弃我的经验,希望有人会发现它有用。 In general it can be a network issue or too much network flood in combination with ack=ALL .通常,它可能是网络问题或与ack=ALL结合使用过多的网络泛滥。 Here a diagram that explain the TimeoutException from Kafka KIP-91 at he time of writing (still applicable till 1.1.0):这是在撰写本文时解释来自Kafka KIP-91TimeoutException的图表(直到 1.1.0 仍然适用):

在此处输入图片说明

Excluding network configuration issues or errors, this are the properties you can adjust depending on your scenario in order to mitigate or solve the problem:不包括网络配置问题或错误,您可以根据您的场景调整这些属性以缓解或解决问题:

  • The buffer.memory controls the total memory available to a producer for buffering. buffer.memory控制生产者可用于缓冲的总内存。 If records get sent faster than they can be transmitted to Kafka then and this buffer will get exceeded then additional send calls block up to max.block.ms after then Producer throws a TimeoutException .如果记录的发送速度比它们可以传输到 Kafka 的速度快,那么这个缓冲区将被超过,那么在 Producer 抛出TimeoutException之后,额外的发送调用会阻塞到max.block.ms

  • The max.block.ms has already a high value and I do not suggest to further increment it. max.block.ms已经有一个很高的值,我不建议进一步增加它。 buffer.memory has the default value of 32MB and depending on you message size you may want to increase it; buffer.memory的默认值为 32MB,根据您的消息大小,您可能希望增加它; if necessary increase the jvm heap space.如有必要,增加 jvm 堆空间。

  • Retries define how many attempts to resend the record in case of error before giving up.重试定义了在出现错误的情况下在放弃之前尝试重新发送记录的次数。 If you are using zero retries you can try to mitigate the problem by increasing this value, beware record order is not guarantee anymore unless you set max.in.flight.requests.per.connection to 1.如果您使用零重试,您可以尝试通过增加此值来缓解问题,请注意记录顺序不再保证,除非您将max.in.flight.requests.per.connection设置为 1。

  • Records are sent as soon as the batch size is reached or the linger time is passed, whichever comes first.一旦达到批量大小或经过延迟时间(以先到者为准),就会发送记录。 if batch.size (default 16kb) is smaller than the maximum request size perhaps you should use a higher value.如果batch.size (默认16kb)小于最大请求大小,也许您应该使用更高的值。 In addition, change linger.ms to a higher value such as 10, 50 or 100 to optimize the use of the batch and the compression.此外,将linger.ms更改为更高的值,例如10、50或 100,以优化批处理和压缩的使用。 This will cause less flood in the network and optimize compression if you are using it.如果您正在使用它,这将减少网络中的泛滥并优化压缩。

There is not an exact answer on this kind of issues since they depends also on the implementation, in my case experimenting with the values above helped.此类问题没有确切的答案,因为它们还取决于实现,在我的情况下,尝试使用上述值有所帮助。

We have faced similar problem.我们也遇到过类似的问题。 Many NetworkExceptions in the logs and from time to time TimeoutException .日志中的许多NetworkExceptionsTimeoutException不时发生。

Cause原因

Once we gathered TCP logs from production it turned out that some of the TCP connections to Kafka brokers (we have 3 broker nodes) were dropped without notifying clients after like 5 minutes of being idle (no FIN flags on TCP layer).一旦我们从生产中收集 TCP 日志,结果发现在闲置 5 分钟后(TCP 层上没有FIN标志),一些到 Kafka 代理(我们有 3 个代理节点)的 TCP 连接被丢弃,而没有通知客户端。 When client was trying to re-use this connection after that time, then RST flag was returned.当客户端在此之后尝试重新使用此连接时,则返回RST标志。 We could easily match those connections resets in TCP logs with NetworkExceptions in application logs.我们可以轻松地将 TCP 日志中的这些连接重置与应用程序日志中的NetworkExceptions进行匹配。

As for TimeoutException , we could not do the same matching as by the time we found the cause, this type of error was not occurring anymore.至于TimeoutException ,我们无法进行与找到原因时相同的匹配,此类错误不再发生。 However we confirmed in a separate test that dropping TCP connection could also result in TimeoutException .然而,我们在单独的测试中确认,丢弃 TCP 连接也可能导致TimeoutException I guess this is because of the fact that Java Kafka Client is using Java NIO Socket Channel under the hood.我想这是因为 Java Kafka Client 在底层使用 Java NIO Socket Channel。 All the messages are being buffered and then dispatched once connection is ready.所有消息都被缓冲,然后在连接准备好后分派。 If connection will not be ready within timeout (30 seconds), then messages will expire resulting in TimeoutException .如果连接在超时(30 秒)内未准备好,则消息将过期,导致TimeoutException

Solution解决方案

For us the fix was to reduce connections.max.idle.ms on our clients to 4 minutes.对我们来说,解决方法是将我们客户端的connection.max.idle.ms减少到 4 分钟。 Once we applied it, NetworkExceptions were gone from our logs.一旦我们应用了它, NetworkExceptions就从我们的日志中消失了。

We are still investigating what is dropping the connections.我们仍在调查断开连接的原因。

Edit编辑

The cause of the problem was AWS NAT Gateway which was dropping outgoing connections after 350 seconds.问题的原因是 AWS NAT 网关在 350 秒后断开传出连接。

https://docs.aws.amazon.com/vpc/latest/userguide/nat-gateway-troubleshooting.html#nat-gateway-troubleshooting-timeout https://docs.aws.amazon.com/vpc/latest/userguide/nat-gateway-troubleshooting.html#nat-gateway-troubleshooting-timeout

Solution 1解决方案1

Modify修改

listeners=PLAINTEXT://hostname:9092

property in the server.properties file to server.properties 文件中的属性

listeners=PLAINTEXT://0.0.0.0:9092

Solution 2解决方案2

Change the broker.id to a value like 1001, change the brocker id by setting the environment variable KAFKA_BROKER_ID .将 broker.id 更改为类似 1001 的值,通过设置环境变量KAFKA_BROKER_ID更改 broker id。

You'll have to set the environment variable KAFKA_RESERVED_BROKER_MAX_ID to omething like 1001 to be allowed to set the broker id to 1001.您必须将环境变量KAFKA_RESERVED_BROKER_MAX_ID设置为 1001 之类的内容,才能将代理 ID 设置为 1001。

I hope it can help我希望它可以帮助

增加request.timeout.ms和生产者的重试

We tried everything, but no luck. 我们尝试了一切,但没有运气。

  1. Decreased producer batch size and increased request.timeout.ms. 减少生产者批次大小并增加request.timeout.ms。
  2. Restarted target kafka cluster, still no luck. 重新启动目标kafka群集,仍然没有运气。
  3. Checked replication on target kafka cluster, that as well was working fine. 检查了目标kafka群集上的复制,也可以正常工作。
  4. Added retries, retries.backout.ms in prodcuer properties. 在产品属性中添加了重试,retries.backout.ms。
  5. Added linger.time as well in kafka prodcuer properties. 在kafka产品属性中添加了linger.time。

Finally our case there was issue with kafka cluster itself, from 2 servers we were unable to fetch metadata in between. 最后,我们的案例是kafka群集本身存在问题,无法从2个服务器之间获取元数据。

When we changed target kafka cluster to our dev box, it worked fine. 当我们将目标kafka集群更改为我们的开发箱时,它运行良好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM