简体繁体 English

由于服务器关闭套接字而导致写入失败时，Netty无法从服务器读取字节

[英]Netty failing to read bytes from server when write fails due to socket close by server

原文 2013-10-14 20:03:57 4 2 java/ sockets/ netty

Netty Version: 4.0.10.Final 网络版本：4.0.10.Final

I've written a client and server using Netty. 我已经使用Netty编写了客户端和服务器。 Here is what client and server do. 这是客户端和服务器执行的操作。

Server: 服务器：

Wait for connection from client 等待客户端的连接
Receive messages from client 接收来自客户端的消息
If a message is bad, write error message (6 bytes), flush it, close the socket and do not read any unread messages in the socket. 如果消息不正确，请写入错误消息（6个字节），将其刷新，关闭套接字，并且不要读取套接字中的任何未读消息。 Otherwise continue reading messages. 否则，继续阅读消息。 Do nothing with good messages. 用好消息什么都不做。

Client: 客户：

Connect to server. 连接到服务器。
After writing N good messages, write one bad message and continue writing M good messages. 写完N条好消息后，写一条坏消息并继续写M条好消息。 This process happens in a separate thread. 此过程在单独的线程中发生。 This thread is started after the channel is active. 通道活动后，将启动此线程。
If there is any response from server, log it and close the socket. 如果服务器有任何响应，请将其记录下来并关闭套接字。 (Note that server responds only when there is an error) （请注意，服务器仅在出现错误时才响应）

I've straced both client and server. 我已经架设了客户端和服务器。 I've found that server is closing connection after writing the error message. 我发现写入错误消息后服务器正在关闭连接。 Client began seeing broken pipe errors when writing good messages after the bad message. 在错误消息之后编写良好消息时，客户端开始看到断开的管道错误。 This is because server detected bad message and responded with error message and closed socket. 这是因为服务器检测到错误消息，并以错误消息和关闭的套接字响应。 connection is closed only after the write operation is complete using a listener. 仅在使用侦听器完成写操作之后，才关闭连接。 Client is not reading error message from server always. 客户端不总是从服务器读取错误消息。 Earlier step (2) in client is performed in I/O thread. 客户端中的较早步骤（2）在I / O线程中执行。 This caused the % of error messages received over K number of experiments to be really low (<10%). 这导致在K次实验中收到的错误消息所占的百分比确实很低（<10％）。 After moving step (2) to separate thread, % went to (70%). 将步骤（2）移动到单独的线程后，％达到（70％）。 In any case it is not accurate. 无论如何都不是准确的。 Does netty trigger channel read if the write fails due to broken pipe? 如果由于管道中断而导致写入失败，netty是否触发通道读取？

Update 1 : I'm clarifying and answering any questions asked here, so everybody can find the asked questions/clarifications at one place. 更新1 ：我正在澄清并回答此处提出的任何问题，因此每个人都可以在一个地方找到所提出的问题/澄清。 "You're writing a bad message that will cause a reset, followed by good messages that you already know won't get through, and trying to read a response that may have been thrown away. It doesn't make any sense to me whatsoever" - from EJP “您写的一条错误消息将导致重置，然后是您已经知道不会通过的好消息，并试图读取可能被丢弃的响应。对我来说这没有任何意义一切”-来自EJP

-- In real world the server could treat something as bad for whatever reason client can't know in advance. -在现实世界中，服务器可能会由于客户端事先不知道的任何原因而将其视为不良。 For simplification, I said client intentionally sends a bad message that causes reset from server. 为简化起见，我说过客户端故意发送一条错误消息，导致服务器重置。 I would like to send all good messages even if there are bad messages in the total messages. 我希望发送所有好的消息，即使总消息中有坏消息也是如此。

What I'm doing is similar to the protocol implemented by Apple Push Notification Service . 我正在做的事情类似于Apple Push Notification Service实现的协议。

2 个解决方案

If a message is bad, write error message (6 bytes), flush it, close the socket and do not read any unread messages in the socket. 如果消息不正确，请写入错误消息（6个字节），将其刷新，关闭套接字，并且不要读取套接字中的任何未读消息。 Otherwise continue reading messages. 否则，继续阅读消息。

That will cause a connection reset, which will be seen by the client as a broken pipe in Unix, Linux etc. 这将导致连接重置，客户端将其视为Unix，Linux等中的损坏管道。

After writing N good messages, write one bad message and continue writing M good messages. 写完N条好消息后，写一条坏消息并继续写M条好消息。

That will encounter the broken pipe error just mentioned. 那将遇到刚才提到的管道破裂错误。

This process happens in a separate thread. 此过程在单独的线程中发生。

Why? 为什么？ The whole point of NIO and therefore Netty is that you don't need extra threads. NIO和Netty的全部要点是您不需要额外的线程。

I've found that server is closing connection after writing the error message. 我发现写入错误消息后服务器正在关闭连接。

Well that's what you said it does, so it does it. 嗯，这就是您所说的，所以做到了。

Client began seeing broken pipe errors when writing good messages after the bad message. 在错误消息之后编写良好消息时，客户端开始看到断开的管道错误。

As I said. 就像我说的。

This is because server detected bad message and responded with error message and closed socket. 这是因为服务器检测到错误消息，并以错误消息和关闭的套接字响应。

Correct. 正确。

Client is not reading error message from server always. 客户端不总是从服务器读取错误消息。

Due to the connection reset. 由于连接重置。 The delivery of pending data ceases after a reset. 重置后，挂起数据的传送将停止。

Does netty trigger channel read if the write fails due to broken pipe? 如果由于管道中断而导致写入失败，netty是否触发通道读取？

No, it triggers read when data or EOS arrives 不，它会在数据或EOS到达时触发读取

However your bizarre system design/protocol is making that unpredictable if not impossible. 但是，您的离奇的系统设计/协议使这种情况变得不可预知，即使不是不可能。 You're writing a bad message that will cause a reset, followed by good messages that you already know won't get through, and trying to read a response that may have been thrown away. 您正在写一条错误消息，它将导致重置，然后写出您已经知道不会通过的好消息，并尝试读取可能已被丢弃的响应。 It doesn't make any sense to me whatsoever. 这对我来说毫无意义。 What are you trying to prove here? 您要在这里证明什么？

Try a request-response protocol like everybody else. 像其他人一样尝试请求-响应协议。

The APN protocol appears to be quite awkward because it does not acknowledge successful receipt of a notification. APN协议似乎很笨拙，因为它没有确认成功接收到通知。 Instead it just tells you which notifications it has successfully received when it encounters an error. 相反，它只会告诉您遇到错误时已成功接收到哪些通知。 The protocol is working on the assumption that you will generally send well formed notifications. 该协议假设您通常会发送格式正确的通知。

I would suggest that you need some sort of expiring cache (a LinkedHashMap might work here) and you need to use the opaque identifier field in the notification as a globally unique, ordered value. 我建议您需要某种过期的缓存（LinkedHashMap可能在这里工作），并且需要将通知中的不透明标识符字段用作全局唯一的有序值。 A sequence number will work (but you'll need to persist if your client can be restarted). 序列号将起作用（但是，如果可以重新启动客户端，则需要坚持执行）。

Every time you generate an APN 每次生成APN

set its identifier to the next sequence number 将其标识符设置为下一个序列号
send it 发送
place it in the LinkedHashMap with a string key of sequence number concatenated with the current time (eg String key = sequenceNumber + "-" + System.currentTimeMillis() ) 将其放置在LinkedHashMap中，并将其序列号与当前时间串联在一起（例如，字符串键= sequenceNumber +“-” + System.currentTimeMillis（））

If you receive an error you need to reopen the connection and resend all the APNs in the map with a sequence number higher than the identifier reported in the error. 如果收到错误，则需要重新打开连接并重新发送映射中的所有APN，并使用比错误中报告的标识符更高的序列号。 This is relatively easy. 这是相对容易的。 Just iterate through the map removing any APN with a sequence number lower than that reported. 只需遍历图，删除序列号低于所报告序列号的任何APN。 Then resend the remain APNs in order, replacing them in the map with the current time (ie you remove an APN when you resend it, then re-insert into the map with the new current time). 然后按顺序重新发送其余的APN，将其替换为当前时间在地图中（即，您在重新发送APN时将其删除，然后使用新的当前时间重新插入地图中）。

You'll need to periodically purge the map of old entries. 您需要定期清除旧条目的地图。 You need to determine what is a reasonable length of time based on how long it takes the APN service to return an error if you send a malformed APN. 如果发送格式错误的APN，则需要根据APN服务返回错误所需的时间来确定合理的时间长度。 I suspect it'll be a matter of seconds (if not much quicker). 我怀疑这将是几秒钟的事情（如果不是更快的话）。 If, for example, you're sending 10 APNs / second, and you know that the APN server will definitely respond within 30 seconds, a 30 second expiry time, purging every second, might be appropriate. 例如，如果您每秒发送10个APN，并且您知道APN服务器肯定会在30秒内响应，则30秒的到期时间（每秒清除）可能是合适的。 Just iterate along the map removing any elements which has a time section of it's key that is less than System.currentTimeMillis() - 30000 (for 30 second expiry time). 只需沿着地图进行迭代，即可删除键的时间段小于System.currentTimeMillis（）-30000（持续30秒的有效时间）的所有元素。 You'll need to synchronize threads appropriately. 您需要适当地同步线程。

I would catch any IOExceptions caused by writing and place the APN you were attempting to write in the map and resend. 我会捕获任何由写入引起的IOException，并将您尝试写入的APN放置在映射中并重新发送。

What you cannot cope with is a genuine network error whereby you do not know if the APN service received the notification (or a bunch of notifications). 您无法解决的是真正的网络错误，由此您不知道APN服务是否收到了通知（或一堆通知）。 You'll have to make a decision based on what your service is as to whether you resend the affected APNs immediately, or after some time period, or not at all. 您是否要立即或在一段时间后重新发送受影响的APN，就必须根据服务的内容来做出决定。 If you send after a time period you'll want to give them new sequence numbers at the point you send them. 如果您在一段时间后发送邮件，则需要在发送时给他们新的序列号。 This will allow you to send new APNs in the meantime. 这样，您就可以同时发送新的APN。