简体   繁体   中英

Hashing (sha1) multiple file concurrently using threads

I have N big files (no less than 250M) to hash. Those files are on P physical drives.

I'd like to hash them concurrently with maximum K active threads but I can not hash more than M files per physical drives because it slows down the whole process (I ran a test, parsing 61 files, and with 8 threads it was slower than with 1 thread; the file were almost all on the same disk).

I am wondering what would be the best approach to this :

  • I could use a Executors.newFixedThreadPool(K)
  • then I would submit the task using a counter to determine if I should add a new task.

My code would be :

int K = 8;
int M = 1;
Queue<Path> queue = null; // get the files to hash
final ExecutorService newFixedThreadPool = Executors.newFixedThreadPool(K);
final ConcurrentMap<FileStore, Integer> counter = new ConcurrentHashMap<>();
final ConcurrentMap<FileStore, Integer> maxCounter = new ConcurrentHashMap<>();
for (FileStore store : FileSystems.getDefault().getFileStores()) {
  counter.put(store, 0);
  maxCounter.put(store, M);
}
List<Future<Result>> result = new ArrayList<>();
while (!queue.isEmpty()) {
  final Path current = queue.poll();
  final FileStore store = Files.getFileStore(current);
  if (counter.get(store) < maxCounter.get(store)) {
    result.add(newFixedThreadPool.submit(new Callable<Result>() {

      @Override
      public Entry<Path, String> call() throws Exception {
        counter.put(store, counter.get(store) + 1);
        String hash = null; // Hash the file
        counter.put(store, counter.get(store) - 1);
        return new Result(path, hash);
      }

    }));
  } else queue.offer(current);
}

Tossing aside the potential non thread safe operation (like how I play with counter), is there a better way to achieve my goal ?

I also think the loop here might be a little too much, as it may take up a lot of process (almost like an infinite loop).

After much time, I've found a solution to achieve my need: instead of integer counter, or AtomicInteger or whatever, I've used an ExecutorService and each submitted task use a Semaphore shared across each file of one drive.

Like:

ConcurrentMap<FileStore, Semaphore> map = new ConcurrentHashMap<>();
ExecutorService es = Executors.newFixedThreadPool(10);
for (Path path : listFile()) {
  final FileStore store = Files.getFileStore(path);
  final Semaphore semaphore = map.computeIfAbsent(store, key -> new Semaphore(getAllocatedCredits(store)));
  final int cost = computeCost(path);
  es.submit(() -> {
    semaphore.acquire(cost);
    try {
      ... some work ...
    } finally {
      semaphore.release(cost);
    }
  });
}


int getAllocatedCredits(FileStore store) {return 2;}
int computeCost(Path path) {return 1;}

Notice the help of Java 8, especially in computeIfAbsent and submit .

If the drive hardware configuration is not known at compile time, and may be chaged/upgraded, it's tempting to use a thread pool per drive and make the thread counts user-configurable. I am not famililar with 'newFixedThreadPool' - is the thread count a property that can be changed at run time to optimize performance?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM