具有大量任务的ExecutorService

Question

I have a list of files and a list of analyzers that analyze those files. 我有一个文件列表和一个分析这些文件的分析器列表。 Number of files can be large (200,000) and number of analyzers (1000). 文件数量可以大（200,000），分析仪数量可以（1000）。 So total number of operations can be really large (200,000,000). 因此，操作总数可能真的很大（200,000,000）。 Now, I need to apply multithreading to speed things up. 现在，我需要应用多线程来加快速度。 I followed this approach: 我遵循这种方法：

ExecutorService executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
for (File file : listOfFiles) {
  for (Analyzer analyzer : listOfAnalyzers){
    executor.execute(() -> {
      boolean exists = file.exists();
      if(exists){
        analyzer.analyze(file);
      }
    });
  }
}
executor.shutdown();
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.SECONDS);

But the problem of this approach is that it's taking too much from memory and I guess there is better way to do it. 但是这种方法的问题是它占用了太多内存，我想还有更好的方法。 I'm still beginner at java and multithreading. 我仍然是Java和多线程的初学者。

Answer 1

Where are 200M tasks going to reside? 2亿个任务将驻留在哪里？ Not in memory, I hope, unless you plan to implement your solution in a distributed fashion. 我希望，除非您打算以分布式方式实现解决方案，否则请不要将其存储在内存中。 In meantime, you need to instantiate an ExecutorService that does not accumulate a massive queue. 同时，您需要实例化不会累积大量队列的ExecutorService 。 Use with the "caller runs policy" (see here ) when you create the service . 创建服务时，请与“主叫方运行策略”一起使用（请参阅此处）。 If you try to put another task in the queue when it's already full, you'll end up executing it yourself, which is probably what you want. 如果尝试在另一个任务已满时将其放入队列中，您将最终自己执行它，这可能就是您想要的。

OTOH, now that I look at your question more conscientiously, why not analyze a single file concurrently? OTOH，现在我更加认真地研究您的问题，为什么不同时分析单个文件？ Then the queue is never larger than the number of analyzers. 然后，队列永远不会大于分析器的数量。 That's what I'd do, frankly, since I'd like a readable log that has a message for each file as I load it, in the correct order. 坦率地说，这就是我要做的，因为我想要一个可读的日志，该日志在加载文件时会以正确的顺序显示每个文件的消息。

I apologize for not being more helpful: 很抱歉没有提供更多帮助：

analysts.stream().map(analyst -> executor.submit(() -> analyst.analyze(file))).map(Future::get);

Basically, create bunch of futures for a single file, then wait for all of them before you move on. 基本上，为一个文件创建一堆期货，然后等待所有这些期货再继续。

Answer 2

One idea is to employ fork/join algorithm and group items (files) into batches in order to process them individually. 一种想法是采用fork / join算法并将项目（文件）分组为批处理，以便分别处理它们。

My suggestion is the following: 我的建议如下：

Firstly, filter out all files that do not exist - they occupy resources unnecessarily. 首先，过滤掉所有不存在的文件-它们不必要地占用了资源。

The following pseudo-code demonstrates the algorithm that might help you out: 以下伪代码演示了可以帮助您的算法：

 public static class CustomRecursiveTask extends RecursiveTask<Integer { private final Analyzer[] analyzers; private final int threshold; private final File[] files; private final int start; private final int end; public CustomRecursiveTask(Analyzer[] analyzers, final int threshold, File[] files, int start, int end) { this.analyzers = analyzers; this.threshold = threshold; this.files = files; this.start = start; this.end = end; } @Override protected Integer compute() { final int filesProcessed = end - start; if (filesProcessed < threshold) { return processSequentially(); } else { final int middle = (start + end) / 2; final int analyzersCount = analyzers.length; final ForkJoinTask<Integer> left = new CustomRecursiveTask(analyzers, threshold, files, start, middle); final ForkJoinTask<Integer> right = new CustomRecursiveTask(analyzers, threshold, files, middle + 1, end); left.fork(); right.fork(); return left.join() + right.join(); } } private Integer processSequentially() { for (int i = start; i < end; i++) { File file = files[i]; for(Analyzer analyzer : analyzers) { analyzer.analyze(file) }; } return 1; } }

And the usage looks the following way: 用法如下所示：

 public static void main(String[] args) {
    final Analyzer[] analyzers = new Analyzer[]{};
    final File[] files = new File[] {};

    final int threshold = files.length / 5;

    ForkJoinPool.commonPool().execute(
            new CustomRecursiveTask(
                    analyzers,
                    threshold,
                    files,
                    0,
                    files.length
            )
    );
}

Notice that depending on constraints you can manipulate the task's constructor arguments so that the algorithm will adjust to the amount of files. 请注意，根据约束条件，您可以操纵任务的构造函数参数，以便算法将调整为文件量。

You could specify different threshold s let's say depending on the amount of files. 您可以根据文件的数量指定不同的threshold 。

final int threshold;
if(files.length > 100_000) {
   threshold = files.length / 4;
} else {
   threshold = files.length / 8;
}

You could also specify the amount of worker threads in ForkJoinPool depending on the input amount. 您还可以根据输入量在ForkJoinPool指定辅助线程的数量。

Measure, adjust, modify, you will solve the problem eventually. 测量，调整，修改，您最终将解决问题。

Hope that helps. 希望能有所帮助。

UPDATE: 更新：

If the result analysis is of no interest, you could replace the RecursiveTask with RecursiveAction . 如果结果分析RecursiveTask ，则可以将RecursiveTask替换为RecursiveAction 。 The pseudo-code adds auto-boxing overhead in between. 伪代码在这之间增加了自动装箱的开销。

具有大量任务的ExecutorService

问题描述

2 个解决方案

解决方案1
4 已采纳 2018-06-28 13:45:50

解决方案2
2 2018-06-28 14:37:13

具有大量任务的ExecutorService

问题描述

2 个解决方案

解决方案1 4 已采纳 2018-06-28 13:45:50

解决方案2 2 2018-06-28 14:37:13

解决方案1
4 已采纳 2018-06-28 13:45:50

解决方案2
2 2018-06-28 14:37:13