I have N big files (no less than 250M) to hash. Those files are on P physical drives.
I'd like to hash them concurrently with maximum K active threads but I can not hash more than M files per physical drives because it slows down the whole process (I ran a test, parsing 61 files, and with 8 threads it was slower than with 1 thread; the file were almost all on the same disk).
I am wondering what would be the best approach to this :
My code would be :
int K = 8;
int M = 1;
Queue<Path> queue = null; // get the files to hash
final ExecutorService newFixedThreadPool = Executors.newFixedThreadPool(K);
final ConcurrentMap<FileStore, Integer> counter = new ConcurrentHashMap<>();
final ConcurrentMap<FileStore, Integer> maxCounter = new ConcurrentHashMap<>();
for (FileStore store : FileSystems.getDefault().getFileStores()) {
counter.put(store, 0);
maxCounter.put(store, M);
}
List<Future<Result>> result = new ArrayList<>();
while (!queue.isEmpty()) {
final Path current = queue.poll();
final FileStore store = Files.getFileStore(current);
if (counter.get(store) < maxCounter.get(store)) {
result.add(newFixedThreadPool.submit(new Callable<Result>() {
@Override
public Entry<Path, String> call() throws Exception {
counter.put(store, counter.get(store) + 1);
String hash = null; // Hash the file
counter.put(store, counter.get(store) - 1);
return new Result(path, hash);
}
}));
} else queue.offer(current);
}
Tossing aside the potential non thread safe operation (like how I play with counter), is there a better way to achieve my goal ?
I also think the loop here might be a little too much, as it may take up a lot of process (almost like an infinite loop).
After much time, I've found a solution to achieve my need: instead of integer counter, or AtomicInteger
or whatever, I've used an ExecutorService
and each submitted task use a Semaphore
shared across each file of one drive.
Like:
ConcurrentMap<FileStore, Semaphore> map = new ConcurrentHashMap<>();
ExecutorService es = Executors.newFixedThreadPool(10);
for (Path path : listFile()) {
final FileStore store = Files.getFileStore(path);
final Semaphore semaphore = map.computeIfAbsent(store, key -> new Semaphore(getAllocatedCredits(store)));
final int cost = computeCost(path);
es.submit(() -> {
semaphore.acquire(cost);
try {
... some work ...
} finally {
semaphore.release(cost);
}
});
}
int getAllocatedCredits(FileStore store) {return 2;}
int computeCost(Path path) {return 1;}
Notice the help of Java 8, especially in computeIfAbsent
and submit
.
If the drive hardware configuration is not known at compile time, and may be chaged/upgraded, it's tempting to use a thread pool per drive and make the thread counts user-configurable. I am not famililar with 'newFixedThreadPool' - is the thread count a property that can be changed at run time to optimize performance?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.