简体   繁体   English

如何找到TCP连接错误的原因

[英]How to find the cause of bad TCP connections

We're developing an online game where players communicate with the server using a persistent TCP connection. 我们正在开发一种在线游戏,玩家使用持久的TCP连接与服务器进行通信。 Persistent as in, its lifetime is that of a player's session, and if the connection is closed, the player is thrown from the game (though the client will attempt to automatically reconnect). 持续存在,其生命周期是玩家的会话,如果连接关闭,玩家将被抛出游戏(尽管客户端将尝试自动重新连接)。

Problem 问题

Now, of course everything works fine in our office (connecting to both testing and live servers), but our client reports that some players get disconnected a lot (every few seconds), and that they experience it themselves too (though their offices are in the same building). 现在,当然一切都在我们的办公室工作正常(连接到测试和现场服务器),但我们的客户报告说一些玩家断断续续(每隔几秒钟),并且他们自己也经历过(尽管他们的办公室在同一栋楼)。

Question

How can I find out the cause of these disconnects? 我怎样才能找出这些断开的原因? Is it because: 是因为:

  • Players have bad internet connections and it can't be helped. 玩家网络连接不良,无法提供帮助。
  • The distance between players and server (Turkey <-> Netherlands) is too long. 玩家和服务器之间的距离(土耳其< - >荷兰)太长。
  • Something is wrong with the server (a CentOS machine) or the datacenter. 服务器(CentOS机器)或数据中心出了问题。
  • The server is overloaded (though it happens under low loads too). 服务器过载(尽管它也在低负载下发生)。
  • There is an error in our software. 我们的软件出错。
  • Or some other reason? 还是其他一些原因?

The software is written in Java. 该软件是用Java编写的。 It logs when players are disconnected, and if it actively kicks them (eg for not sending keep-alive messages) it logs that too. 它会在玩家断开连接时记录,如果它主动踢它们(例如,为了不发送保持活动消息),它也会记录。

Known data 已知数据

  • Whenever a spurious disconnect is reported and I check the logs, most of the time I don't see that player getting actively kicked by the server software, only see that the connection has been closed. 每当报告虚假断开连接并检查日志时,大多数时候我都没有看到该播放器被服务器软件主动踢出,只看到连接已关闭。
  • There is an internal monitoring service which has a bunch of localhost connections to the game server, the same way players do, and it doesn't get disconnected. 有一个内部监控服务,它与游戏服务器有一堆本地主机连接,与玩家的方式相同,并且不会断开连接。

Others 其他

There are many other online games like ours. 还有许多像我们这样的在线游戏。 How do they deal with this? 他们如何处理这个问题? (Unless the problem is in the server/datacenter, then the solution is obvious) (除非问题出在服务器/数据中心,否则解决方案很明显)

  • Do they use UDP? 他们使用UDP吗? I know action games do, for speed, but I presume TCP is normal for eg online poker and other slow games? 我知道动作游戏的速度,但我认为TCP是正常的,例如在线扑克和其他慢速游戏? (Not that that would help us, our client software is made in Flash, which doesn't support UDP) (这不会对我们有帮助,我们的客户端软件是用Flash制作的,不支持UDP)
  • Is there some TCP tweaking that can be done to make it more lenient? 是否有一些TCP调整可以使它更宽松?
  • Or do they get these disconnects as well, just reconnect more transparently? 或者他们也会得到这些断开连接,只是更透明地重新连接?
  • Is there information about this on the web? 网上有关于此的信息吗?

I would ask players to allow you to enable "anonymous usage data", like many apps do, to periodically upload debugging information from their sessions back to you. 我会要求玩家允许您启用“匿名使用数据”,就像许多应用程序一样,定期将调试信息从他们的会话上传回给您。 This is how you figure out these sorts of situations. 这就是你弄清楚这些情况的方法。

From there, what you'll need when a disconnect happens, is a pretty verbose log. 从那里,当断开连接时你需要的是一个非常详细的日志。 When the disconnect happens, catch whatever exception was thrown (and don't forget to also log the cause via a call to .getCause() - making as many calls to .getCause() as necessary until you've logged all the way back to the root cause), as well as any relevant data you need to match up the client log with the server-side logs. 当发生断开连接时,捕获抛出的任何异常(并且不要忘记通过调用.getCause()来记录原因 - 根据需要调用.getCause() ,直到你一直记录为止根本原因),以及将客户端日志与服务器端日志匹配所需的任何相关数据。 Information you'll likely need includes like session IDs, game IDs, timestamps, etc. Just think, "What information do I think I would need in order to troubleshoot this, assuming I had insight into both sides of the connection?" 您可能需要的信息包括会话ID,游戏ID,时间戳等。只要想一想,“我认为我需要哪些信息来解决这个问题,假设我已经了解了连接的两个方面?” which is what you'll ultimately get with asking users to upload usage and debugging data. 这是您最终要求用户上传使用情况和调试数据的方法。

From there you should be able to figure out at least a few situations where you have control over it - that is, where you can change your client/server code in order to alleviate some of the problems. 从那里你应该能够找出至少一些你可以控制它的情况 - 也就是说,你可以在哪里改变你的客户端/服务器代码,以减轻一些问题。 In some cases, where the problem is either a client's configuration or faulty equipment (or maybe a piece of equipment in between that neither of your control), you'll have to rely on robust re-connectivity. 在某些情况下,问题是客户端的配置或故障设备(或者可能是您之间无法控制的设备之一),您将不得不依赖强大的重新连接。

You'll never reduce disconnects to zero, but this information, after you see enough cases of it, should help you reduce the occurrence of disconnects to the situations that are outside of your control alone, at which point your power to shape the network will ultimately end, and you'll be as close to a "best case scenario" with network reliability as you can be. 您永远不会将断开连接减少到零,但是在您看到足够多的情况之后,此信息应该可以帮助您减少断开连接到单独控制之外的情况,此时您的网络形状将会变形最终结束,你将尽可能接近具有网络可靠性的“最佳案例场景”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM