简体繁体 English

异步服务器停止从客户端获取数据，没有明显的原因

[英]Asynchronous server stopping getting data from client with no visible reason

原文 2011-10-22 15:45:48 2 1 c++/ boost/ asynchronous/ client-server/ boost-asio

I have a problem with client-server application. 客户服务器应用程序出现问题。 As I've almost run out of sane ideas for its solving I am asking for help. 由于解决它的理智想法几乎用光了，我在寻求帮助。 I've stumbled into described situation about three or four times now. 我现在偶然发现了大约三到四次的情况。 Provided data is from last failure, when I've turned all the possible logging, messages dumping and so on. 当我关闭所有可能的日志记录，消息转储等等时，所提供的数据来自上一次故障。

System description 系统描述
1) Client. 1） 客户。 Works under Windows. 在Windows下工作。 I take as an assumption that there is no problem with its work (judging from logs) 我假设它的工作没有问题（从日志来看）
2) Server. 2） 服务器。 Works under Linux (RHEL 5). 在Linux（RHEL 5）下工作。 It is server where I has a problem. 这是我有问题的服务器。
3) Two connections are maintained between client and server: one command and one for data sending. 3）客户端和服务器之间保持两个连接 ：一个命令和一个用于数据发送。 Both work asynchronously. 两者都是异步工作的。 Both connections live in one thread and on one boost::asio::io_service . 两种连接都在一个线程中，并在一个boost::asio::io_service 。
4) Data to be sent from client to server is messages delimeted by '\\0'. 4）从客户端发送到服务器的数据是由'\\ 0'修饰的消息。
5) Data load is about 50 Mb/hour, 24 hours a day. 5） 数据负载约为每天24小时50 Mb /小时。
6) Data is read on server side using boost::asio::async_read_until with corresponding delimeter 6）使用boost::asio::async_read_until和相应的分隔符在服务器端读取数据

Problem 问题
- For two days system worked as expected -为期两天的系统按预期工作
- On third day at 18:55 server read one last message from client and then stopped reading them. -第三天18:55服务器从客户端读取了最后一条消息，然后停止读取它们。 No info in logs about new data. 日志中没有有关新数据的信息。
- From 18:55 to 09:00 (14 hours) client reported no errors. -从18:55到09:00 （14小时），客户未报告任何错误。 So it sent data (about 700 Mb) successfully and no errors arose. 因此它成功发送了数据（大约700 Mb），并且没有出现错误。
- At 08:30 I started investigation of a problem. -在08:30我开始调查问题。 Server process was alive, both connections between server and client were alive too. 服务器进程处于活动状态，服务器与客户端之间的连接也处于活动状态。
- At 09:00 I attached to server process with gdb . -在09:00我使用gdb附加到服务器进程。 Server was in sleeping state, waiting for some signal from system. 服务器处于睡眠状态，正在等待来自系统的某些信号。 I believe I accidentally hit Ctrl + C and may be there was some message. 我相信我不小心按了Ctrl + C，可能有一些消息。
- Later in logs I found message with smth like 'system call interrupted'. -后来在日志中，我发现了诸如“系统调用中断”之类的消息。 After that both connections to client were dropped. 之后，与客户端的两个连接均被删除。 Client reconnected and server started to worked normally. 客户端重新连接，服务器开始正常工作。
- The first message processed by server was timestamped at 18:57 on client side. -服务器处理的第一条消息在客户端的时间戳为18:57 。 So after restarting normal work, server didn't drop all the messages up to 09:00 , they were stored somewhere and it processed them accordingly after that. 因此，在重新开始正常工作之后，服务器直到09:00为止都没有丢弃所有消息，它们存储在某个位置，并在此之后进行相应处理。

Things I've tried 我尝试过的事情
- Simulated scenario above. -上面的模拟场景。 As server dumped all incoming messages I've wrote a small script which presented itself as client and sent all the messages back to server again. 当服务器转储所有传入消息时，我编写了一个小脚本，该脚本将自己呈现为客户端，然后将所有消息再次发送回服务器。 Server dropped with out of memory error, but, unfortunately, it was because of high data load (about 3 Gb/hour this time), not because of the same error. 服务器out of memory不足错误而掉线，但是不幸的是，这是由于数据负载过高（这次大约为3 Gb /小时），而不是因为相同的错误。 As it was Friday evening I had no time to correctly repeat the experiment. 由于是星期五晚上，我没有时间正确地重复实验。
- Nevertheless, I've run server through Valgrind to detect possible memory leaks. -尽管如此，我已经通过Valgrind运行服务器来检测可能的内存泄漏。 Nothing serious was found (except the fact that server was dropped because of high load), no huge memory leaks. 没有发现任何严重问题（除了由于高负载而导致服务器掉落的事实），没有大的内存泄漏。

Questions 问题
- Where were these 700 Mb of data which client sent and server didn't get? -客户端发送的和服务器未获得的700 Mb数据在哪里？ Why they were persistent and weren't lost when server restarted the connection? 当服务器重新启动连接时，为什么它们是持久的而不会丢失？
- It seems to me that problem is someway connected with server not getting message from boost::asio::io_service . -在我看来，问题出在某种程度上与服务器未从boost::asio::io_service获取消息有关。 Buffer is get filled with data, but no calls to read handler are made. 缓冲区中充满了数据，但是没有调用读取处理程序。 Could this be problem on OS side? 这可能是OS方面的问题吗？ Something wrong with asynchronous calls may be? 异步调用可能有问题吗？ If it is so, how could this be checked? 如果是这样，如何检查呢？
- What can I do to detect the source of problem? -我该怎么做才能找出问题的根源？ As i said I've run out of sane ideas and each experiment costs very much in terms of time (it takes about two or three days to get the system to described state), so I need to run as much possible checks for experiment as I could. 正如我说的那样，我已经用尽了理智的想法，并且每个实验都花费大量的时间（使系统达到描述状态大约需要两到三天），因此我需要尽可能多地进行实验检查我可以。

Would be grateful for any ideas I can use to get to the error. 非常感谢我可以用来解决错误的任何想法。

Update: Ok, it seems that error was in synchronous write left in the middle of asynchronous client-server interaction. 更新：好的，似乎错误在于异步客户机/服务器交互过程中遗留的同步write 。 As both connections lived in one thread, this synchronous write was blocking thread for some reason and all interaction both on command and data connection stopped. 由于两个连接都驻留在一个线程中，因此出于某种原因，此同步write正在阻塞线程，并且命令和数据连接上的所有交互都停止了。 So, I changed it to async version and now it seems to work. 因此，我将其更改为异步版本，现在似乎可以使用了。

1 个解决方案

As i said I've run out of sane ideas and each experiment costs very much in terms of time (it takes about two or three days to get the system to described state) 正如我说的那样，我用尽了理智的想法，并且每个实验都花费大量的时间（使系统达到描述状态大约需要两到三天）

One way to simplify investigation of this problem is to run server inside some Virtual Machine until it reaches this broken state. 一种简化对此问题进行调查的方法是在某个虚拟机中运行服务器，直到达到此故障状态为止。 Then you can make snapshot of whole system and revert to it every time when things go wrong during investigation. 然后，您可以制作整个系统的快照，并在调查过程中每次出现问题时都将其还原。 At least you will not have to wait 3 days to get this state again. 至少您不必等待3天即可再次获得此状态。