简体   繁体   中英

CXF Webservice randomly waits 3 seconds for the Apache Tomcat response

We have a Java 8 application served by an Apache Tomcat 8 behind an Apache server, which is requesting multiple webservices in parallel using CXF. From time to time, there's one of them which lasts exactly 3 more seconds than the rest (which should be about 500ms only).

I've activated CXF debug, and now I have the place inside CXF where the 3 seconds are lost:

14/03/2018 09:20:49.061 [pool-838-thread-1] DEBUG o.a.cxf.transport.http.HTTPConduit - No Trust Decider for Conduit '{http://ws.webapp.com/}QueryWSImplPort.http-conduit'. An affirmative Trust Decision is assumed. 
14/03/2018 09:20:52.077 [pool-838-thread-1] DEBUG o.a.cxf.transport.http.HTTPConduit - Sending POST Message with Headers to http://172.16.56.10:5050/services/quertServices Conduit :{http://ws.webapp.com/}QueryWSImplPort.http-conduit

As you could see, there're three seconds between these two lines. When the request is ok, it usually takes 0ms in between these two lines.

I've been looking into the CXF code, but no clue about the reason of these 3 secs...

The server application (which is also maintained by us), is served from another Apache Tomcat 6.0.49, which is behind another Apache server. The thing is that it seems that the server's Apache receives the request after the 3 seconds.

Anyone could help me?

EDIT: We've monitored the server's send/received packages and it seems that the client's server is sending a negotiating package at the time it should, while the server is replying after 3 seconds. These are the packages we've found:

481153  11:31:32 14/03/2018 2429.8542795    tomcat6.exe SOLTESTV010     SOLTESTV002     TCP TCP:Flags=CE....S., SrcPort=65160, DstPort=5050, PayloadLen=0, Seq=2858646321, Ack=0, Win=8192 ( Negotiating scale factor 0x8 ) = 8192  {TCP:5513, IPv4:62}
481686  11:31:35 14/03/2018 2432.8608381    tomcat6.exe SOLTESTV002     SOLTESTV010     TCP TCP:Flags=...A..S., SrcPort=5050, DstPort=65160, PayloadLen=0, Seq=436586023, Ack=2858646322, Win=8192 ( Negotiated scale factor 0x8 ) = 2097152    {TCP:5513, IPv4:62}

481687  11:31:35 14/03/2018 2432.8613607    tomcat6.exe SOLTESTV010     SOLTESTV002     TCP TCP:Flags=...A...., SrcPort=65160, DstPort=5050, PayloadLen=0, Seq=2858646322, Ack=436586024, Win=256 (scale factor 0x8) = 65536    {TCP:5513, IPv4:62}
481688  11:31:35 14/03/2018 2432.8628380    tomcat6.exe SOLTESTV010     SOLTESTV002     HTTP    HTTP:Request, POST /services/consultaServices    {HTTP:5524, TCP:5513, IPv4:62}

So, it seems is the server's Tomcat is the one which is blocked with something. Any clue?

EDIT 2:

Although that happened yesterday (the first server waiting 3s for the ack of the second), this is not the most common scenario. What it usually happens is what I described at the beginning (3 seconds between the two CXF's logs and the server receiving any request from the first one after 3 seconds.

There has been some times when the server (the one which receives the request), hangs for 3 seconds. For instance:

  1. Server 1 sends 5 requests at the same time (suppossedly) to server 2.

  2. Server 2 receives 4 of them, in that same second, and start to process them.

  3. Server 2 finish processing 2 of those 4 requests in 30ms and replies to server 1.

  4. More or less at this same second, there's nothing registered in the application logs.

  5. After three seconds, logs are registered again, and the server finish to process the remaining 2 requests. So, although the process itself is about only some milliseconds, the response_time - request_time is 3 seconds and a few ms.

  6. At this same time, the remaining request (the last one from the 5 request which were sent), is registered in the network monitor and is processed by the application in just a few milliseconds. However, the global processing time is anything more than 3s, as it has reached the server 3 seconds after being sent.

So there's like a hang in the middle of the process. 2 requests were successfully processed before this hang and replied in just a fraction of a second. 2 other request lasted a little bit more, the hang happened, and ended with a processing time of 3 seconds. The last one, reached the server just when the hang happened, so it didn't get into the application after the hang.

It sounds like a gc stop the world... but we have analyzed gc.logs and there's nothing wrong with that... could there be any other reason?

Thanks!

EDIT 3:

Looking at the TCP flags like the one I pasted last week, we've noticed that there are lots of packets with the CE flag, that is a notification of TCP congestion. We're not network experts, but have found that this could deal us to a 3 seconds of delay before the packet retransmission... could anyone give us some help about that?

Thanks. Kind regards.

The thing is that it seems that the server's Apache receives the request after the 3 seconds.

How do you figure this out ? If you're looking at Apache logs, you can be misleaded by wrong time stamps.

I first thought that your Tomcat 6 takes 3 seconds to answer instead of 0 to 500ms, but from the question and the comments, it is not the case.

Hypothesis 1 : Garbage Collector

The GC is known for introducing latency. Highlight the GC activity in your logs by using the GC verbosity parameters . If it is too difficult to correlate, you can use the jstat command with the gcutil option and you can compare easily with the Tomcat's log.

Hypothesis 2 : network timeout

Although 3s is a very short time (in comparison with the 21s TCP default timeout on Windows for example), it could be a timeout. To track the timeouts, you can use the netstat command. With netstat -an , look for the SYN_SENT connections, and with netstat -s look for the error counters. Please check if there is any network resource that must be resolved or accessed in this guilty webservice caller.

Finally, it was everything caused by the network congestion we discovered looking at the TCP flags. Our network admins has been looking at the problem, trying to reduce the congestion, reducing the timeout to retransmit.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM