简体   繁体   中英

ExecutorService task execution intermittently delayed

I'm running a Java 7 Dropwizard app on a CentOS 6.4 server that basically serves as a layer on top of a data store (Cassandra) and does some additional processing. It also has an interface to Zookeeper using the Curator framework for some other stuff. This all works well and good most of the time, CPU and RAM load is never above 50% and usually about 10% and our response times are good.

My problem is that recently we've discovered that occasionally we get blips of about 1-2 seconds where seemingly all tasks scheduled via thread pools get delayed. We noticed this because of connection timeouts to Cassandra and session timeouts with Zookeeper. What we've done to narrow it down:

  1. Used Wireshark and Boundary to make sure all network activity from our app was getting stalled, not just a single component. All network activity was stalling at the same time.
  2. Wrote a quick little Python script to send timestamp strings to netcat on one of the servers we were seeing timeouts connecting to to make sure it's not an overall network issue between the boxes. We saw all timestamps come through smoothly during periods where our app had timeouts.
  3. Disabled hyperthreading on the server.
  4. Checked garbage collection timing logs for the timeout periods. They were consistent and well under 1ms through the timeout periods.
  5. Checked our CPU and RAM resources during the timeout periods. Again, consistent, and well under significant load.
  6. Added an additional Dropwizard resource to our app for diagnostics that would send timestamp strings to netcat on another server, just like the Python script. In this case, we did see delays in the timestamps when we saw timeouts in our app. With half-second pings, we would generally see a whole second missing entirely, and then four pings in the next second, the extra two being the delayed pings from the previous second.
  7. To remove the network from the equation, we changed the above to just write to the console and a local file instead of to the network. We saw the same results (delayed pings) with both of those.
  8. Profiled and checked our thread pool settings to see if we were using too many OS threads. /proc/sys/kernel/threads-max is 190115 and we never get above 1000.

Code for #7 (#6 is identical except for using a Socket and PrintWriter in place of the FileWriter):

public void start() throws IOException {
    fileWriter = new FileWriter(this.fileName, false);

    executor = Executors.newSingleThreadScheduledExecutor();
    executor.scheduleAtFixedRate(this, 0, this.delayMillis, TimeUnit.MILLISECONDS);
}

@Override
public synchronized void run() {
    try {
        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
        Date now = new Date();
        String debugString = "ExecutorService test " + this.content + " : " + sdf.format(now) + "\n";
        fileWriter.write(debugString);
        fileWriter.flush();
    } catch (Exception e) {
        logger.error("Error running ExecutorService test: " + e.toString());
    }
}

So it seems like the Executor is scheduling the tasks to be run, but they're being delayed in starting (because the timestamps are delayed and there's no way the first two lines of the try block in the run method are delaying the task execution). Any ideas on what might cause this or other things we can try? Hopefully we won't get to the point where we start reverting the code until we find what change caused it...

TL;DR: Scheduled tasks are being delayed and we don't know why.

UPDATE 1: We modified the executor task to push timestamps every half-second into a ring buffer instead of straight out to a file, and then dump the buffer every 20 seconds. This removes I/O as a possible cause of blocking task execution but still gives us the same info. From this, we still saw the same pattern of timestamps, from which it appears that the issue is not something in the task occasionally blocking the next execution of the task, but something in the task execution engine itself delaying execution for some reason.

When you use scheduleAtFixedRate , your expressing a desire that your task should be run as close to that rate as possible. The executor will do its best to keep to it, but sometimes it can't.

Your using Executors.newSingleThreadScheduledExecutor() , and so the executor only has a single thread to play with. If each execution of the task is taking longer than the period you specified in your schedule, then the executor won't be able to keep up, since the single thread may not have finished executing the previous run before the schedule kicked in the execute the next run. The result would manifest itself as delays in the schedule. This would seem a plausible explanation, since you say your real code is writing to a socket. That can easily block and send your timing off kilter.

You can find out if this is indeed the case by adding more logging at the end of the run method (ie after the flush ). If the IO is taking too long, you'll see that in the logs.

As a fix, you could consider using scheduleWithFixedDelay instead, which will add a delay between each execution of the task, so long-running tasks don't run into each other. Failing that, then you need to ensure that the socket write completes on time, allowing each subsequent task execution to start on schedule.

The first step to diagnose a liveness issue is usually taking a thread dump when the system is stalled, and check what the threads were doing. In your case, the executor threads would be of particular interest. Are they processing, or are they waiting for work?

If they are all processing, the executor service has run out of worker threads, and can only schedule new tasks once a current task has been completed. This may be caused by tasks temporarily taking longer to complete. The stack traces of the worker threads may yield a clue just what is taking longer.

If many worker threads are idle, you have found a bug in the JDK. Congratulations!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM