简体   繁体   中英

Reading and writing multiple files in parallel

I need to write a program in Java which will read a relatively large number (~50,000) files in a directory tree, process the data, and output the processed data in a separate (flat) directory.

Currently I have something like this:

private void crawlDirectoyAndProcessFiles(File directory) {
  for (File file : directory.listFiles()) {
    if (file.isDirectory()) {
      crawlDirectoyAndProcessFiles(file);
    } else { 
      Data d = readFile(file);
      ProcessedData p = d.process();
      writeFile(p,file.getAbsolutePath(),outputDir);
    }
  }
}

Suffice to say that each of those methods is removed and trimmed down for ease of reading, but they all work fine. The whole process works fine, except that it is slow. The processing of data occurs via a remote service and takes between 5-15 seconds. Multiply that by 50,000...

I've never done anything multi-threaded before, but I figure I can get some pretty good speed increases if I do. Can anyone give some pointers how I can effectively parallelise this method?

I would use a ThreadPoolExecutor to manage the threads. You can do something like this:

private class Processor implements Runnable {
    private final File file;

    public Processor(File file) {
        this.file = file;
    }

    @Override
    public void run() {
        Data d = readFile(file);
        ProcessedData p = d.process();
        writeFile(p,file.getAbsolutePath(),outputDir);
    }
}

private void crawlDirectoryAndProcessFiles(File directory, Executor executor) {
    for (File file : directory.listFiles()) {
        if (file.isDirectory()) {
          crawlDirectoryAndProcessFiles(file,executor);
        } else {
            executor.execute(new Processor(file); 
        }
    }
}

You would obtain an Executor using:

ExecutorService executor = Executors.newFixedThreadPool(poolSize);

where poolSize is the maximum number of threads you want going at once. (It's important to have a reasonable number here; 50,000 threads isn't exactly a good idea. A reasonable number might be 8.) Note that after you've queued all the files, your main thread can wait until things are done by calling executor.awaitTermination .

Assuming you have a single hard disk (ie something that only allows single simultaneous read operations, not a SSD or RAID array, network file system, etc...), then you only want one thread performing IO (reading from/writing to the disk). Also, you only want as many threads doing CPU bound operations as you have cores, otherwise time will be wasted in context switching.

Given the above restrictions, the code below should work for you. The single threaded executor ensures that only one Runnable executes at any one time. The fixed thread pool ensures no more than NUM_CPUS Runnable s are executing at any one time.

One thing this does not do is to provide feedback on when processing is finished.

private final static int NUM_CPUS = 4;

private final Executor _fileReaderWriter = Executors.newSingleThreadExecutor();
private final Executor _fileProcessor = Executors.newFixedThreadPool(NUM_CPUS);

private final class Data {}
private final class ProcessedData {}

private final class FileReader implements Runnable
{
  private final File _file;
  FileReader(final File file) { _file = file; }
  @Override public void run() 
  { 
    final Data data = readFile(_file);
    _fileProcessor.execute(new FileProcessor(_file, data));
  }

  private Data readFile(File file) { /* ... */ return null; }    
}

private final class FileProcessor implements Runnable
{
  private final File _file;
  private final Data _data;
  FileProcessor(final File file, final Data data) { _file = file; _data = data; }
  @Override public void run() 
  { 
    final ProcessedData processedData = processData(_data);
    _fileReaderWriter.execute(new FileWriter(_file, processedData));
  }

  private ProcessedData processData(final Data data) { /* ... */ return null; }
}

private final class FileWriter implements Runnable
{
  private final File _file;
  private final ProcessedData _data;
  FileWriter(final File file, final ProcessedData data) { _file = file; _data = data; }
  @Override public void run() 
  { 
    writeFile(_file, _data);
  }

  private Data writeFile(final File file, final ProcessedData data) { /* ... */ return null; }
}

public void process(final File file)   
{ 
  if (file.isDirectory())
  {
    for (final File subFile : file.listFiles())
      process(subFile);
  }
  else
  {
    _fileReaderWriter.execute(new FileReader(file));
  }
}

The easiest (and probably one of the most reasonable) way is to have a thread pool (take a look in corresponding Executor). Main thread is responsible to crawl in the directory. When a file is encountered, then create a "Job" (which is a Runnable/Callable) and let the Executor handle the job.

(This should be sufficient for you to start, I prefer not giving too much concrete code coz it should not be difficult for you to figure out once you have read the Executor, Callable etc part)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM