简体   繁体   中英

what can cause large discrepancy between minor GC time and total pause time?

We have a latency-sensitive application, and are experiencing some GC-related pauses we don't fully understand. We occasionally have a minor GC that results in application pause times that are much longer than the reported GC time itself. Here is an example log snippet:

485377.257: [GC 485378.857: [ParNew: 105845K->621K(118016K), 0.0028070 secs] 136492K->31374K(1035520K), secs] [Times: user=0.01 sys=0.00, real=1.61 secs] secs] [次:用户= 0.01 sys = 0.00,real = 1.61 secs]
Total time for which application threads were stopped: seconds

The total pause time here is orders of magnitude longer than the reported GC time. These are isolated and occasional events: the immediately preceding and succeeding minor GC events do not show this large discrepancy.

The process is running on a dedicated machine, with lots of free memory, 8 cores, running Red Hat Enterprise Linux ES Release 4 Update 8 with kernel 2.6.9-89.0.1EL-smp. We have observed this with (32 bit) JVM versions 1.6.0_13 and 1.6.0_18.

We are running with these flags:

-server -ea -Xms512m -Xmx512m -XX:+UseConcMarkSweepGC -XX:NewSize=128m -XX:MaxNewSize=128m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:-TraceClassUnloading

Can anybody offer some explanation as to what might be going on here, and/or some avenues for further investigation?

You're positive you're not swapping? Typically seeing:

Times: user=0.01 sys=0.00, real=1.61 secs

(from your trace)

suggests that something has happened in the process that doesn't take CPU but does take wall clock time... and that's usually swap or other I/O. a bit of iostat might help shed light...

Are you using a lot of native memory outside the Java heap? (possibly via DirectByteBuffer, nio, etc..) that may be eating into your "lots of free memory" statement (much to your surprise). 'top' or vmstat might also show this.

"Time-to-safepoint" is a wide cause for this sort of thing. Unfortunately, GC logs only the time from when it started doing work (after ALL application thread have been paused at a safepoint) to when if finished (after which the threads will be released from their safepoints). -XX:+PrintGCApplicationStoppedTime (much more correctly) reports the time from telling the first thread to go to a safepoint to the time the last thread was released to run again.

It is unfortunately common to see one thread takes a long time to come to a safe point, and when this happens, all the other nice and polite threads that went to a safepoint and paused there when told will be waiting until the straggler comes in. Examples of such things are long runtime operations. Eg cloning objects array is done with no internal safepoint opportunities in most JVMs (imagine cloning a 1GB array, and happening to need to take a GC pause in the middle). Optimized counted loops in your own can also end up running very long without internal safepoints.

[Zing has a built-in time-to-safepoint profiler, partly to track and beat down this sort of thing].

You say there's "lots of free memory" but your heap size is capped at 512MB. You might be running out of memory more often/earlier than you think.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM