简体   繁体   English

Java FutureTask <>而不使用ExecutorService?

[英]Java FutureTask<> without using an ExecutorService?

Recently a use case came up where I had to kick off several blocking IO tasks at the same time and use them in sequence. 最近出现了一个用例,我不得不同时启动几个阻塞IO任务并按顺序使用它们。 I did not want to change the order of operation on the consumption side and since this was a web app and these were short-lived tasks in the request path, I didn't want to bottleneck on a fixed threadpool and was looking to mirror the .Net async/await coding style. 我不想改变消费方面的操作顺序,因为这是一个Web应用程序,这些是请求路径中的短期任务,我不想在固定的线程池上遇到瓶颈,并且希望镜像.Net异步/等待编码风格。 The FutureTask<> seemed ideal for this but required an ExecutorService. FutureTask<>似乎是理想的,但需要一个ExecutorService。 This is an attempt to remove the need for one. 这是为了消除对一个的需要。

Order of operation: 操作顺序:

  1. Kick off tasks 启动任务
  2. Do some stuff 做一些事情
  3. Consume Task 1 消耗任务1
  4. Do some other stuff 做一些其他的事情
  5. Consume Task 2 消耗任务2
  6. Finish up 完事

    ... ...

I wanted to spawn a new thread for each FutureTask<> but simplify the thread management. 我想为每个FutureTask<>生成一个新线程,但是简化了线程管理。 After run() completed, the calling thread could be joined. run()完成后,可以连接调用线程。

The solution I came up with was: 我想出的解决方案是:

package com.staples.search.util; package com.staples.search.util;

import java.util.concurrent.Callable;
import java.util.concurrent.Future;
import java.util.concurrent.FutureTask;

public class FutureWrapper<T> extends FutureTask<T> implements Future<T> {

    private Thread myThread;

    public FutureWrapper(Callable<T> callable) {
    super(callable);
    myThread = new Thread(this);
    myThread.start();
    }

    @Override
    public T get() {
    T val = null;
    try {
        val = super.get();
        myThread.join();
    }
    catch (Exception ex)
    {
        this.setException(ex);
    }
    return val;
    }
}

Here are a couple of JUnit tests I created to compare FutureWrapper to CachedThreadPool . 下面是我创建的几个JUnit测试,用于比较FutureWrapperCachedThreadPool

@Test
public void testFutureWrapper() throws InterruptedException, ExecutionException {
long startTime  = System.currentTimeMillis();
int numThreads = 2000;

List<FutureWrapper<ValueHolder>> taskList = new ArrayList<FutureWrapper<ValueHolder>>();

System.out.printf("FutureWrapper: Creating %d tasks\n", numThreads);

for (int i = 0; i < numThreads; i++) {
    taskList.add(new FutureWrapper<ValueHolder>(new Callable<ValueHolder>() { 
    public ValueHolder call() throws InterruptedException {
            int value = 500;
            Thread.sleep(value);
            return new ValueHolder(value);
    }
    }));
}

for (int i = 0; i < numThreads; i++)
{
    FutureWrapper<ValueHolder> wrapper = taskList.get(i);
    ValueHolder v = wrapper.get();
}

System.out.printf("Test took %d ms\n", System.currentTimeMillis() - startTime);

Assert.assertTrue(true);
}

@Test
public void testCachedThreadPool() throws InterruptedException, ExecutionException {
long startTime  = System.currentTimeMillis();
int numThreads = 2000;

List<Future<ValueHolder>> taskList = new ArrayList<Future<ValueHolder>>();
ExecutorService esvc = Executors.newCachedThreadPool();

System.out.printf("CachedThreadPool: Creating %d tasks\n", numThreads);

for (int i = 0; i < numThreads; i++) {
    taskList.add(esvc.submit(new Callable<ValueHolder>() { 
    public ValueHolder call() throws InterruptedException {
            int value = 500;
            Thread.sleep(value);
            return new ValueHolder(value);
    }
    }));
}

for (int i = 0; i < numThreads; i++)
{
    Future<ValueHolder> wrapper = taskList.get(i);
    ValueHolder v = wrapper.get();
}

System.out.printf("Test took %d ms\n", System.currentTimeMillis() - startTime);

Assert.assertTrue(true);
}

class ValueHolder {
    private int value;
    public ValueHolder(int val) { value = val; }
    public int getValue() { return value; }
    public void setValue(int val) { value = val; }
}

Repeated runs puts the FutureWrapper at ~925ms vs. ~935ms for the CachedThreadPool. 重复运行将FutureWrapper为~925ms,而CachedThreadPool设置为~935ms。 Both tests bump into OS thread limits. 两个测试都会进入OS线程限制。

Things seem to work and the thread spawning is pretty fast (10k threads with random sleeps in ~4s). 事情似乎工作,线程产生相当快(10k线程随机睡眠~4s)。 Does anyone see something wrong with this implementation? 有没有人看到这个实现有问题?

Creating and starting thousands of threads is usually a very bad idea, because creating threads is expensive, and having more threads than processors will bring no performance gain but cause thread-context-switches that consume CPU-cycles instead. 创建和启动数千个线程通常是一个非常糟糕的主意,因为创建线程是昂贵的,并且拥有比处理器更多的线程将不会带来性能增益,但会导致线程上下文切换而不是消耗CPU周期。 (See notes very below) (见下面的说明)

So in my opinion, your test-code contains a big error in reasoning: You are simulating CPU load by calling Thread.sleep(500) . 所以在我看来,你的测试代码在推理中包含一个很大的错误:你通过调用Thread.sleep(500)来模拟CPU负载。 But in fact, this does not really cause the CPU to do anything. 但事实上,这并不能真正导致CPU做任何事情。 It is possible to have many sleeping threads in parallel - no matter how many processors you have, but it is not possible to run more CPU consuming tasks than processors in (real) parallel. 可以并行使用许多睡眠线程 - 无论您拥有多少个处理器,但是不可能比(实际)并行处理器运行更多CPU消耗任务。

If you simulate real CPU load, you'll see, that more threads will just increase the overhead due to thread-management, but not decrease the total processing time. 如果你模拟真正的CPU负载,你会发现,由于线程管理,更多线程只会增加开销,但不会减少总处理时间。


So let's compare different ways to run CPU consuming tasks in parallel! 因此,让我们比较并行运行CPU消耗任务的不同方法!

First, let's assume we've got some CPU consuming task that always takes the same amount of time: 首先,让我们假设我们有一些CPU消耗任务总是花费相同的时间:

public Integer task() throws Exception {
    // do some computations here (e.g. fibonacchi, primes, cipher, ...)
    return 1;
}

Our goal is to run this task NUM_TASKS times using different execution strategies. 我们的目标是使用不同的执行策略运行此任务NUM_TASKS次。 For our tests, we set NUM_TASKS = 2000 . 对于我们的测试,我们设置NUM_TASKS = 2000

(1) Using a thread-per-task strategy (1)使用每任务线程策略

This strategy is very comparable to your approach, with the difference, that it is not necessary to subclass FutureTask and fiddle around with threads. 这种策略与您的方法非常相似,不同之处在于,没有必要FutureTask并使用线程。 Instead, you can use FutureTask directly as it is both, a Runnable and a Future : 相反,您可以直接使用FutureTask ,因为它既是Runnable又是Future

@Test
public void testFutureTask() throws InterruptedException, ExecutionException {
    List<RunnableFuture<Integer>> taskList = new ArrayList<RunnableFuture<Integer>>();

    // run NUM_TASKS FutureTasks in NUM_TASKS threads
    for (int i = 0; i < NUM_TASKS; i++) {
        RunnableFuture<Integer> rf = new FutureTask<Integer>(this::task);
        taskList.add(rf);
        new Thread(rf).start();
    }

    // now wait for all tasks
    int sum = 0;
    for (Future<Integer> future : taskList) {
        sum += future.get();
    }

    Assert.assertEquals(NUM_TASKS, sum);
}

Running this test with JUnitBenchmarks (10 test iterations + 5 warmup iterations) yields the following result: 使用JUnitBenchmarks运行此测试(10次测试迭代+ 5次预热迭代)会产生以下结果:

ThreadPerformanceTest.testFutureTask: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.66 [+- 0.01], round.block: 0.00 [+-
0.00], round.gc: 0.00 [+- 0.00], GC.calls: 66, GC.time: 0.06, time.total: 10.59, time.warmup: 4.02, time.bench: 6.57

So one round (execution time of method task() ) is about 0.66 seconds. 所以一轮(方法task()执行时间)大约是0.66秒。

(2) Using a thread-per-cpu strategy (2)使用线程每CPU策略

This strategy uses a fixed number of threads to execute all tasks. 此策略使用固定数量的线程来执行所有任务。 Therefore, we create an ExecutorService via Executors.newFixedThreadPool(...) . 因此,我们通过Executors.newFixedThreadPool(...)创建一个ExecutorService The number of threads should be equal to the number of CPUs ( Runtime.getRuntime().availableProcessors() ), which is 8 in my case. 线程数应该等于CPU的数量( Runtime.getRuntime().availableProcessors() ),在我的情况下为8。

To be able to track the results, we simply use a CompletionService . 为了能够跟踪结果,我们只需使用CompletionService It automatically takes care of the results - no matter in which order they arrive. 它会自动处理结果 - 无论它们到达的顺序如何。

@Test
public void testFixedThreadPool() throws InterruptedException, ExecutionException {
    ExecutorService exec = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
    CompletionService<Integer> ecs = new ExecutorCompletionService<Integer>(exec);

    // submit NUM_TASKS tasks
    for (int i = 0; i < NUM_TASKS; i++) {
        ecs.submit(this::task);
    }

    // now wait for all tasks
    int sum = 0;
    for (int i = 0; i < NUM_TASKS; i++) {
        sum += ecs.take().get();
    }

    Assert.assertEquals(NUM_TASKS, sum);
}

Again we run this test with JUnitBenchmarks with the same settings. 我们再次使用具有相同设置的JUnitBenchmarks运行此测试。 The results are: 结果是:

ThreadPerformanceTest.testFixedThreadPool: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 0.41 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 22, GC.time: 0.04, time.total: 6.59, time.warmup: 2.53, time.bench: 4.05

Now one round is only 0.41 seconds (almost 40% runtime reduction)! 现在一轮只有0.41秒(运行时减少了近40%)! Also not the fewer GC calls. 也不是更少的GC调用。

(3) Sequential execution (3)顺序执行

For comparison we should also measure the non-parallelized execution: 为了比较,我们还应该测量非并行执行:

@Test
public void testSequential() throws Exception {
    int sum = 0;
    for (int i = 0; i < NUM_TASKS; i++) {
        sum += this.task();
    }

    Assert.assertEquals(NUM_TASKS, sum);
}

The results: 结果:

ThreadPerformanceTest.testSequential: [measured 10 out of 15 rounds, threads: 1 (sequential)]
 round: 1.50 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+-0.00], GC.calls: 244, GC.time: 0.15, time.total: 22.81, time.warmup: 7.77, time.bench: 15.04

Note that 1.5 seconds is for 2000 executions, so a single execution of task() takes 0.75 ms. 请注意,1.5秒是2000次执行,因此task()的单次执行需要0.75 ms。

Interpretation 解释

According to Amdahl's law , the time T(n) to execute an algorithm on n processors, is: 根据Amdahl定律 ,在n个处理器上执行算法的时间T(n)是:

T(N)= T(1)*(B +(1-B)/ N)

B is the fraction of the algorithm that cannot be parallelized and must run sequentially. B是算法中无法并行化的部分,必须按顺序运行。 For pure sequential algorithms, B is 1 , for pure parallel algorithms it would be 0 (but this is not possible as there is always some sequential overhead). 对于纯顺序算法, B1 ,对于纯并行算法,它将为0 (但这是不可能的,因为总是有一些顺序开销)。

T(1) can be taken from our sequential execution: T(1) = 1.5 s T(1)可以从我们的顺序执行中获取: T(1)= 1.5 s

If we had no overhead ( B = 0 ), on 8 CPUs we'd got: T(8) = 1.5 / 8 = 0.1875 s . 如果我们没有开销( B = 0 ),我们得到8个CPU: T(8)= 1.5 / 8 = 0.1875 s

But we do have overhead! 但我们确实有开销! So let's compute B for our two strategies: 那么让我们为我们的两个策略计算B

  • B(thread-per-task) = 0.36 B(每任务线程数)= 0.36
  • B(thread-per-cpu) = 0.17 B(thread-per-cpu)= 0.17

In other words: The thread-per-task strategy has twice the overhead! 换句话说: 每任务线程策略有两倍的开销!

Finally, let's compute the speedup S(n) . 最后,让我们计算加速比S(n) That's the number of times, an algorithm runs faster on n CPUs compared to sequential execution ( S(1) = 1 ): 这是次数,与顺序执行相比,算法在n个 CPU上运行得更快( S(1)= 1 ):

S(N)= 1 /(B +(1-B)/ N)

Applied to our two strategies, we get: 应用于我们的两个策略,我们得到:

  • thread-per-task : S(8) = 2.27 每个任务的线程S(8)= 2.27
  • thread-per-cpu : S(8) = 3.66 thread-per-cpuS(8)= 3.66

So the thread-per-cpu strategy has about 60% more speedup than thread-per-task . 因此, 每个线程的线程策略比每个任务线程的速度提高了大约60%。

TODO 去做

We should also measure and compare memory consumption. 我们还应该测量和比较内存消耗。


Note: This all is only true for CPU consuming tasks. 注意:这一切仅适用于CPU使用任务。 If instead, your tasks perform lots of I/O related stuff, you might benefit from having more threads than CPUs as waiting for I/O will put a thread in idle mode, so the CPU can execute another thread meanwhile. 相反,如果您的任务执行大量与I / O相关的事情,您可能会因为拥有更多线程而不是CPU,因为等待I / O会将线程置于空闲模式,因此CPU可以同时执行另一个线程。 But even in this case, there is a reasonable upper limit which is usually far below 2000 on a PC. 但即使在这种情况下,PC上也有一个合理的上限,通常远低于2000。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM