简体   繁体   English

并行读写多个文件

[英]Reading and writing multiple files in parallel

I need to write a program in Java which will read a relatively large number (~50,000) files in a directory tree, process the data, and output the processed data in a separate (flat) directory. 我需要用Java编写一个程序,它将读取目录树中相对较多的(~50,000)个文件,处理数据,并在单独的(平面)目录中输出处理过的数据。

Currently I have something like this: 目前我有这样的事情:

private void crawlDirectoyAndProcessFiles(File directory) {
  for (File file : directory.listFiles()) {
    if (file.isDirectory()) {
      crawlDirectoyAndProcessFiles(file);
    } else { 
      Data d = readFile(file);
      ProcessedData p = d.process();
      writeFile(p,file.getAbsolutePath(),outputDir);
    }
  }
}

Suffice to say that each of those methods is removed and trimmed down for ease of reading, but they all work fine. 可以说,为了便于阅读,每个方法都被删除和修剪,但它们都可以正常工作。 The whole process works fine, except that it is slow. 整个过程工作正常,但速度很慢。 The processing of data occurs via a remote service and takes between 5-15 seconds. 数据处理通过远程服务进行,需要5-15秒。 Multiply that by 50,000... 乘以50,000 ...

I've never done anything multi-threaded before, but I figure I can get some pretty good speed increases if I do. 我之前从未做过任何多线程的事情,但我认为如果我这样做,我可以获得一些非常好的速度提升。 Can anyone give some pointers how I can effectively parallelise this method? 任何人都可以指出我如何有效地并行化这种方法?

I would use a ThreadPoolExecutor to manage the threads. 我会使用ThreadPoolExecutor来管理线程。 You can do something like this: 你可以这样做:

private class Processor implements Runnable {
    private final File file;

    public Processor(File file) {
        this.file = file;
    }

    @Override
    public void run() {
        Data d = readFile(file);
        ProcessedData p = d.process();
        writeFile(p,file.getAbsolutePath(),outputDir);
    }
}

private void crawlDirectoryAndProcessFiles(File directory, Executor executor) {
    for (File file : directory.listFiles()) {
        if (file.isDirectory()) {
          crawlDirectoryAndProcessFiles(file,executor);
        } else {
            executor.execute(new Processor(file); 
        }
    }
}

You would obtain an Executor using: 您将使用以下方式获得Executor:

ExecutorService executor = Executors.newFixedThreadPool(poolSize);

where poolSize is the maximum number of threads you want going at once. 其中poolSize是您希望一次性使用的最大线程数。 (It's important to have a reasonable number here; 50,000 threads isn't exactly a good idea. A reasonable number might be 8.) Note that after you've queued all the files, your main thread can wait until things are done by calling executor.awaitTermination . (这里有一个合理的数字很重要; 50,000个线程并不是一个好主意。一个合理的数字可能是8.)请注意,在排队所有文件之后,你的主线程可以等到事情完成后再调用executor.awaitTermination

Assuming you have a single hard disk (ie something that only allows single simultaneous read operations, not a SSD or RAID array, network file system, etc...), then you only want one thread performing IO (reading from/writing to the disk). 假设您有一个硬盘(即只允许单个同时读取操作,而不是SSD或RAID阵列,网络文件系统等...),那么您只需要一个线程执行IO(读取/写入磁盘)。 Also, you only want as many threads doing CPU bound operations as you have cores, otherwise time will be wasted in context switching. 此外,您只需要与拥有内核一样多的线程执行CPU绑定操作,否则将浪费时间在上下文切换中。

Given the above restrictions, the code below should work for you. 鉴于上述限制,下面的代码应该适合您。 The single threaded executor ensures that only one Runnable executes at any one time. 单线程执行程序确保一次只能执行一个Runnable The fixed thread pool ensures no more than NUM_CPUS Runnable s are executing at any one time. 固定线程池确保任何时候都不会执行NUM_CPUS Runnable

One thing this does not do is to provide feedback on when processing is finished. 这样做的一件事是提供有关何时完成处理的反馈。

private final static int NUM_CPUS = 4;

private final Executor _fileReaderWriter = Executors.newSingleThreadExecutor();
private final Executor _fileProcessor = Executors.newFixedThreadPool(NUM_CPUS);

private final class Data {}
private final class ProcessedData {}

private final class FileReader implements Runnable
{
  private final File _file;
  FileReader(final File file) { _file = file; }
  @Override public void run() 
  { 
    final Data data = readFile(_file);
    _fileProcessor.execute(new FileProcessor(_file, data));
  }

  private Data readFile(File file) { /* ... */ return null; }    
}

private final class FileProcessor implements Runnable
{
  private final File _file;
  private final Data _data;
  FileProcessor(final File file, final Data data) { _file = file; _data = data; }
  @Override public void run() 
  { 
    final ProcessedData processedData = processData(_data);
    _fileReaderWriter.execute(new FileWriter(_file, processedData));
  }

  private ProcessedData processData(final Data data) { /* ... */ return null; }
}

private final class FileWriter implements Runnable
{
  private final File _file;
  private final ProcessedData _data;
  FileWriter(final File file, final ProcessedData data) { _file = file; _data = data; }
  @Override public void run() 
  { 
    writeFile(_file, _data);
  }

  private Data writeFile(final File file, final ProcessedData data) { /* ... */ return null; }
}

public void process(final File file)   
{ 
  if (file.isDirectory())
  {
    for (final File subFile : file.listFiles())
      process(subFile);
  }
  else
  {
    _fileReaderWriter.execute(new FileReader(file));
  }
}

The easiest (and probably one of the most reasonable) way is to have a thread pool (take a look in corresponding Executor). 最简单(也可能是最合理的一种)方法是拥有一个线程池(看一下相应的Executor)。 Main thread is responsible to crawl in the directory. 主线程负责在目录中进行爬网。 When a file is encountered, then create a "Job" (which is a Runnable/Callable) and let the Executor handle the job. 遇到文件时,创建一个“Job”(Runnable / Callable)并让Executor处理该作业。

(This should be sufficient for you to start, I prefer not giving too much concrete code coz it should not be difficult for you to figure out once you have read the Executor, Callable etc part) (这应该足以让你开始,我不喜欢给出太多具体的代码,因为一旦你阅读了Executor,Callable等部分你就不难想出来了)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM