I'm running a Java 7 Dropwizard app on a CentOS 6.4 server that basically serves as a layer on top of a data store (Cassandra) and does some additional processing. It also has an interface to Zookeeper using the Curator framework for some other stuff. This all works well and good most of the time, CPU and RAM load is never above 50% and usually about 10% and our response times are good.
My problem is that recently we've discovered that occasionally we get blips of about 1-2 seconds where seemingly all tasks scheduled via thread pools get delayed. We noticed this because of connection timeouts to Cassandra and session timeouts with Zookeeper. What we've done to narrow it down:
/proc/sys/kernel/threads-max
is 190115 and we never get above 1000. Code for #7 (#6 is identical except for using a Socket and PrintWriter in place of the FileWriter):
public void start() throws IOException {
fileWriter = new FileWriter(this.fileName, false);
executor = Executors.newSingleThreadScheduledExecutor();
executor.scheduleAtFixedRate(this, 0, this.delayMillis, TimeUnit.MILLISECONDS);
}
@Override
public synchronized void run() {
try {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
Date now = new Date();
String debugString = "ExecutorService test " + this.content + " : " + sdf.format(now) + "\n";
fileWriter.write(debugString);
fileWriter.flush();
} catch (Exception e) {
logger.error("Error running ExecutorService test: " + e.toString());
}
}
So it seems like the Executor is scheduling the tasks to be run, but they're being delayed in starting (because the timestamps are delayed and there's no way the first two lines of the try
block in the run
method are delaying the task execution). Any ideas on what might cause this or other things we can try? Hopefully we won't get to the point where we start reverting the code until we find what change caused it...
TL;DR: Scheduled tasks are being delayed and we don't know why.
UPDATE 1: We modified the executor task to push timestamps every half-second into a ring buffer instead of straight out to a file, and then dump the buffer every 20 seconds. This removes I/O as a possible cause of blocking task execution but still gives us the same info. From this, we still saw the same pattern of timestamps, from which it appears that the issue is not something in the task occasionally blocking the next execution of the task, but something in the task execution engine itself delaying execution for some reason.
When you use scheduleAtFixedRate
, your expressing a desire that your task should be run as close to that rate as possible. The executor will do its best to keep to it, but sometimes it can't.
Your using Executors.newSingleThreadScheduledExecutor()
, and so the executor only has a single thread to play with. If each execution of the task is taking longer than the period you specified in your schedule, then the executor won't be able to keep up, since the single thread may not have finished executing the previous run before the schedule kicked in the execute the next run. The result would manifest itself as delays in the schedule. This would seem a plausible explanation, since you say your real code is writing to a socket. That can easily block and send your timing off kilter.
You can find out if this is indeed the case by adding more logging at the end of the run
method (ie after the flush
). If the IO is taking too long, you'll see that in the logs.
As a fix, you could consider using scheduleWithFixedDelay
instead, which will add a delay between each execution of the task, so long-running tasks don't run into each other. Failing that, then you need to ensure that the socket write completes on time, allowing each subsequent task execution to start on schedule.
The first step to diagnose a liveness issue is usually taking a thread dump when the system is stalled, and check what the threads were doing. In your case, the executor threads would be of particular interest. Are they processing, or are they waiting for work?
If they are all processing, the executor service has run out of worker threads, and can only schedule new tasks once a current task has been completed. This may be caused by tasks temporarily taking longer to complete. The stack traces of the worker threads may yield a clue just what is taking longer.
If many worker threads are idle, you have found a bug in the JDK. Congratulations!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.