简体   繁体   English

延迟从Java到同一台计算机的多个TCP连接

[英]Delay in multiple TCP connections from Java to the same machine

(See this question in ServerFault ) 在ServerFault中查看此问题

I have a Java client that uses Socket to open concurrent connections to the same machine. 我有一个Java客户端,它使用Socket打开到同一台机器的并发连接。 I am witnessing a phenomenon where one request completes extremely fast, but the others see a delay of 100-3000 milliseconds. 我正在目睹一种请求完成速度非常快的现象,但其他请求的延迟时间为100-3000毫秒。 Packet inspection using Wireshark shows all SYN packets beyond the first wait a long time before leaving the client. 使用Wireshark进行数据包检查会在离开客户端之前显示所有SYN数据包超出第一个等待很长时间。 I am seeing this on both Windows and Linux clients. 我在Windows和Linux客户端都看到了这一点。 What could be causing this? 可能是什么导致了这个? This happens when the client is a Windows 2008 or a Linux box. 当客户端是Windows 2008或Linux机箱时会发生这种情况。

Code attached: 附加代码:

import java.util.*;
import java.net.*;

public class Tester {
    public static void main(String[] args) throws Exception {
        if (args.length < 3) {
            usage();
            return;
        }
        final int n = Integer.parseInt(args[0]);
        final String ip = args[1];
        final int port = Integer.parseInt(args[2]);

        ExecutorService executor = Executors.newFixedThreadPool(n);

        ArrayList<Callable<Long>> tasks = new ArrayList<Callable<Long>>();
        for (int i = 0; i < n; ++i)
            tasks.add(new Callable<Long>() {
                public Long call() {
                    Date before = new Date();
                    try {
                        Socket socket = new Socket();
                        socket.connect(new InetSocketAddress(ip, port));
                    }

                    catch (Throwable e) {
                        e.printStackTrace();
                    }
                    Date after = new Date();
                    return after.getTime() - before.getTime();
                }
            });
        System.out.println("Invoking");
        List<Future<Long>> results = executor.invokeAll(tasks);
        System.out.println("Invoked");
        for (Future<Long> future : results) {
            System.out.println(future.get());
        }
        executor.shutdown();
    }

    private static void usage() {
        System.out.println("Usage: prog <threads> <url/IP Port>");
        System.out.println("Examples:");
        System.out.println("  prog tcp 10 127.0.0.1 2000");
    }
}

Update - the problem reproduces consistently if I clear the relevant ARP entry before running the test program. 更新 - 如果在运行测试程序之前清除相关的ARP条目,问题将一致地重现。 I've tried tuning the TCP retransmission timeout , but that didn't help. 我已经尝试调整TCP重传超时 ,但这没有帮助。 Also, we ported this program to .Net, but the problem still happens. 此外,我们将此程序移植到.Net,但问题仍然存在。

Updated 2 - 3 seconds is the specified delay in creating new connections, from RFC 1122 . 更新2 - 3秒是从RFC 1122创建新连接的指定延迟。 I still don't fully understand why there is a retransmission here, it should be handled by the MAC layer. 我仍然不完全理解为什么这里有重传,它应该由MAC层处理。 Also, we reproduced the problem using netcat, so it has nothing to do with java. 另外,我们使用netcat重现了这个问题,因此它与java无关。

It looks like you use a single underlying HTTP connection. 看起来您使用单个底层HTTP连接。 So other request can't be done before you call close() on the InputStream of an HttpURLConnection , ie before you process the response. 因此,在HttpURLConnectionInputStream上调用close()之前,即在处理响应之前,无法完成其他请求。

Or you should use a pool of HTTP connections. 或者您应该使用HTTP连接池。

You are doing the right thing in reducing the size of the problem space. 您在减少问题空间的大小方面做得很对。 On the surface this is an impossible problem - something that moves between IP stacks, languages and machines, and yet is not arbitrarily reproducible (eg I cannot repro using your code on Windows nor Linux). 从表面上看,这是一个不可能的问题 - 在IP堆栈,语言和机器之间移动,但不是任意可重现的(例如,我无法在Windows或Linux上使用您的代码重新编写代码)。

Some suggestions, going from the top of the stack to the bottom: 一些建议,从堆栈顶部到底部:

  • Code -- you say this happens on .Net and Java. 代码 - 你说这发生在.Net和Java上。 Are there any language/compiler combinations for which it does not happen? 是否有任何语言/编译器组合不会发生? I used your client talking to the SocketTest program from sourceforge and also "nc" with identical results - no delays. 我使用您的客户端与sourceforge的SocketTest程序进行通信,并且“nc”使用相同的结果 - 没有延迟。 Similarly JDK 1.5 vs 1.6 made no difference for me. 类似地,JDK 1.5和1.6对我没有任何影响。

    -- Suppose you pace the speed at which the client sends requests, say one every 500ms. - 假设您调整客户端发送请求的速度,比如每500毫秒一个。 Does the problem repro? 问题是否重现?

  • IP stack -- maybe something is getting stuck in the stack on the way out. IP堆栈 - 也许是在出路时卡在堆栈中的东西。 I see you've ruled out Nagle but don't forget silly stuff like firewalls/ip tables. 我看到你已经排除了Nagle,但不要忘记像防火墙/ ip表这样的愚蠢的东西。 I'd find it hard to believe that the TCP stack on Win and Linux was that hosed, but you never know. 我发现很难相信Win和Linux上的TCP堆栈已经被软化了,但你永远不会知道。

    -- loopback interface handling can be freaky. - 环回接口处理可能很怪异。 Does it repro when you use the machine's real IP? 当你使用机器的真实IP时它会重现吗? What about across the network (or better, back-to-back with a x-over cable to another machine)? 整个网络(或者更好的是,使用x-over电缆连接到另一台机器)?

  • NIC -- if the packets are making it to the cards, consider features of the cards (TCP offload or other 'special' handling) or quirks in the NICs themselves. NIC - 如果数据包正在发送到卡上,请考虑卡的功能(TCP卸载或其他“特殊”处理)或NIC本身的怪癖。 Do you get the same results with other brands of NIC? 您是否与其他品牌的NIC获得相同的结果?

I haven't found a real answer from this discussion. 我没有从这次讨论中找到真正的答案。 The best theory I've come up with is: 我提出的最好的理论是:

  1. TCP layer sends a SYN to the MAC layer. TCP层将SYN发送到MAC层。 This happens from several threads. 这发生在几个线程中。
  2. First thread sees that IP has no match in the ARP table, sends an ARP request. 第一个线程看到IP表在ARP表中没有匹配,发送ARP请求。
  3. Subsequent threads see there is a pending ARP request so they drop the packet altogether. 后续线程看到有一个待处理的ARP请求,因此它们完全丢弃了数据包。 This behavior is probably implemented in the kernel of several operating systems! 这种行为可能是在几个操作系统的内核中实现的!
  4. ARP reply returns, the original SYN request from the first thread leaves the machine and a TCP connection is established. ARP回复返回,来自第一个线程的原始SYN请求离开机器并建立TCP连接。
  5. TCP layer waits 3 seconds as stated in RFC 1122, then retries and succeeds. 如RFC 1122中所述,TCP层等待3秒,然后重试并成功。

I've tried tweaking the timeout in Windows 7 but wasn't successful. 我尝试在Windows 7中调整超时但没有成功。 If anyone can reproduce the problem and provide a workaround, I'll be most helpful. 如果有人可以重现问题并提供解决方法,我将会非常有帮助。 Also, if anyone has more details on why exactly this phenomenon happens only with multiple threads, it would be interesting to hear. 此外,如果有人知道为什么这种现象只发生在多个线程上的详细信息,那么听到它会很有趣。

I'll try to accept this answer as I don't think any of the answers provided a true explanation (see this discussion on meta ). 我会尝试接受这个答案,因为我认为任何答案都没有提供真正的解释(参见有关meta的讨论 )。

If either of the machines is a windows box, I'd take a look at the Max Concurrent Connections on both. 如果其中一台机器是一个Windows框,我会看看两者上的Max Concurrent Connections。 See: http://www.speedguide.net/read_articles.php?id=1497 见: http//www.speedguide.net/read_articles.php?id = 1497

I think this is a app-level limit in some cases, so you'll have to follow the guide to raise them. 在某些情况下,我认为这是应用级限制,因此您必须按照指南提升它们。

In addition, if this is what happens, you should see something in the System Event Log on the offending machine. 此外,如果发生这种情况,您应该在违规机器上的系统事件日志中看到一些内容。

Java client that uses HttpURLConnection to open concurrent connections to the same machine. 使用HttpURLConnection打开到同一台机器的并发连接的Java客户机。

The same machine? 同一台机器? What application does the clients accept? 客户接受什么应用程序? If you wrote that program by yourself, maybe you have to time how fast your server can accept clients. 如果您自己编写该程序,也许您必须计算服务器接受客户端的速度。 Maybe it is just a bad (or not fast working) written server application. 也许它只是一个糟糕(或不快速工作)的书面服务器应用程序。 The servercode looks like this, I think; 我认为服务器代码看起来像这样;

ServerSocket ss = ...;
while (acceptingMoreClients)
{
   Socket s = ss.accept();
   // On this moment the client is connected to the server, so start timing.
   long start = System.currentTimeMillis();
   ClientHandler handler = new ClientHandler(s);
   handler.start();

   // After "handler.start();" the handler thread is started,
   // So the next two commands will be very fast done.
   // That means the server is ready to accept a new client.
   // Stop timing.
   long stop = System.currentTimeMillis();
   System.out.println("Client accepted in " + (stop - start) + " millis");
}

If this result are bad, than you know where the problem is situated. 如果这个结果不好,那么你知道问题出在哪里。
I hope this helps you closer to the solution. 我希望这可以帮助您更接近解决方案。


Question: 题:

To do the test, do you use the ip you recieved from the DHCP server or 127.0.0.1 If that from the DHCP-Server, everything goes thru the router/switch/... from your company. 要进行测试,您是否使用从DHCP服务器或127.0.0.1收到的IP。如果来自DHCP服务器的IP,一切都通过您公司的路由器/交换机/ ...。 That can slow down the whole process. 这可能会减慢整个过程。

Otherwise: 除此以外:

  • In Windows all TCP-traffic (localhost to localhost) will be redirected in the software-layer of the system (not the hardware-layer), that is why you cannot see TCP-traffic with Wireshark. 在Windows中,所有TCP流量(localhost到localhost)将被重定向到系统的软件层(而不是硬件层),这就是为什么您无法通过Wireshark查看TCP流量的原因。 Wireshark only sees the traffic that passes the hardware-layer. Wireshark只能看到通过硬件层的流量。
  • Linux: Wireshark can only see the traffic at the hardware-layer. Linux:Wireshark只能看到硬件层的流量。 Linux doesn't redirect on the software-layer. Linux不会在软件层上重定向。 That is also the reason why InetAddress.getLocalhost().getAddress() 127.0.0.1 returns. 这也是InetAddress.getLocalhost().getAddress() 127.0.0.1返回的原因。

  • So when you use Windows, it is very normal you cannot see the SYN packet, with Wireshark. 因此,当您使用Windows时,使用Wireshark看不到SYN数据包是很正常的。

Martijn. 马亭。

Since the problem isn't reproducible unless you clear the associated ARP cache, what does the entire packet trace look like from a timing perspective, from the time the ARP request is issued until after the 3 second delay? 由于除非您清除相关的ARP缓存,否则问题不可重现,从发出ARP请求到发生3秒延迟之后,从时序角度看整个数据包跟踪是什么样的?

What happens if you open connections to two different IPs? 如果打开两个不同IP的连接会发生什么? Will the first connections to both succeed? 两者的第一次连接是否会成功? If so, that should rule out any JVM or library issues. 如果是这样,那应该排除任何JVM或库问题。

The first SYN can't be sent until the ARP response arrives. 在ARP响应到达之前,不能发送第一个SYN。 Maybe the OS or TCP stack uses a timeout instead of an event for threads beyond the first one that try to open a connection when the associated MAC address isn't known. 当关联的MAC地址未知时,OS或TCP堆栈可能会超出第一个尝试打开连接的线程使用超时而不是事件。

Imagine the following scenario: 想象一下以下场景:

  1. Thread #1 tries to connect, but the SYN can't be sent because the ARP cache is empty, so it queues the ARP request. 线程#1尝试连接,但由于ARP缓存为空,因此无法发送SYN,因此它会对ARP请求进行排队。
  2. Next, Thread #2 (through #N) tries to connect. 接下来,线程#2(通过#N)尝试连接。 It also can't send the SYN packet because the ARP cache is empty. 它也无法发送SYN数据包,因为ARP缓存为空。 This time, though, instead of sending another ARP request, the thread goes to sleep for 3 seconds, as it says in the RFC. 但是,这一次,该线程不再发送另一个ARP请求,而是在RFC中说,它会进入休眠状态3秒钟。
  3. Next, the ARP response arrives. 接下来,ARP响应到来。 Thread #1 wakes up immediately and sends the SYN. 线程#1立即唤醒并发送SYN。
  4. Thread #2 isn't waiting on the ARP request; 线程#2没有等待ARP请求; it has a hard-coded 3-second sleep. 它有一个硬编码的3秒睡眠。 So after 3 seconds, it wakes up, finds the ARP entry it needs, and sends the SYN. 所以在3秒后,它会唤醒,找到它需要的ARP条目,然后发送SYN。

The fact that you see this on multiple clients, with different OS's, and with different application environments on (I assume) the same OS is a strong indication that it's a problem with either the network or the server, not the client. 您在多个客户端,具有不同操作系统以及不同应用程序环境(我假设)相同操作系统上看到这一点的事实强烈表明它与网络或服务器有关,而不是客户端。 This is reinforced by your comment that clearing the ARP table reproduces the problem. 您的评论强调了这一点,即清除ARP表会重现问题。

Do you perhaps have two machines on the switch with the same MAC address? 您是否在交换机上有两台具有相同MAC地址的计算机? (one of which will probably be a router that's spoofing the MAC address). (其中一个可能是一个欺骗MAC地址的路由器)。

Or more likely, if I recall ARP correctly, two machines that have the same hardcoded IP address. 或者更有可能的是,如果我正确地回想起ARP,两台机器具有相同的硬编码IP地址。 When the client sends out "who is IP 123.456.123.456", both will answer, but only one will actually be listening. 当客户端发出“谁是IP 123.456.123.456”时,两者都会回答,但实际上只有一个人会在收听。

Another possibility (I've seen this happen in a corporate environment) is a rogue DHCP server, again giving out the same IP addresses to two machines. 另一种可能性(我在公司环境中看到这种情况)是一个流氓DHCP服务器,再次向两台机器提供相同的IP地址。

I have seen similar behavior when I was getting DNS timeouts. 当我获得DNS超时时,我看到过类似的行为。 To test this, you can either use the IP address directly or enter the IP address in your hosts file. 要对此进行测试,您可以直接使用IP地址,也可以在hosts文件中输入IP地址。

设置socket.setTcpNoDelay( true )帮助吗?

Have you tried to see what system calls are made by running your client with strace . 您是否尝试通过strace运行客户端来查看系统调用。

It's been very helpful to me in the past, while debugging some mysterious networking issues. 在调试一些神秘的网络问题时,它对我来说非常有帮助。

What is the listen backlog on the server? 什么是服务器上的监听积压? How quickly is it accepting connections? 它接受连接的速度有多快? If the backlog fills up, the OS ignores connection attempts. 如果待办事项已填满,则操作系统会忽略连接尝试。 3 seconds later, the client tries again and gets in now that the backlog has cleared. 3秒后,客户端再次尝试,并且现在已经清除积压。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM