简体繁体 English

组合并行处理和dask数组以处理多个图像堆栈

[英]Combination of parallel processing and dask arrays to process multiple image stacks

原文 2016-02-10 20:20:37 1 2 python/ numpy/ dask/ scikit-image

I have a directory containing n h5 file each of which has m image stacks to filter. 我有一个目录，其中包含n h5文件，每个文件都有m要过滤的图像堆栈。 For each image, I will run the filtering (gaussian and laplacian) using dask parallel arrays in order to speed up the processing ( Ref to Dask ). 对于每个图像，我将使用dask并行数组运行过滤（高斯和拉普拉斯），以加快处理速度（请参阅Dask ）。 I will use the dask arrays through the apply_parallel() function in scikit-image. 我将通过scikit-image中的apply_parallel()函数使用dask数组。
I will run the processing on a small server with 20 cpus . 我将在具有20 cpus的小型服务器上运行该处理。

I would like to get an advice to which parallel strategy will make more sense to use: 我想建议哪种并行策略更适合使用：

1) Sequential processing of the h5 files and all the cpus for dask processing 1）顺序处理h5文件和所有cpus以进行dask处理
2) Parallel processing of the h5 files with x cores and use the remaining 20-x to dask processing. 2）使用x内核并行处理h5文件，并使用其余的20-x进行快速处理。
3) Distribute the resources and parallel processing the h5 files, the images in each h5 files and the remaining resources for dask. 3）分发资源并并行处理h5文件，每个h5文件中的图像以及其余的资源以便进行调试。

thanks for the help! 谢谢您的帮助！

2 个解决方案

Use make for parallelization. 使用make进行并行化。

With make -j20 you can tell make to run 20 processes in parallel. 使用make -j20您可以告诉make并行运行20个进程。

By using multiple processes, you avoid the cost of the "global interpreter lock". 通过使用多个过程，可以避免“全局解释器锁定”的开销。 For independent tasks, it is more efficient to use multiple independent processes ( benchmark if you have doubt). 对于独立的任务，使用多个独立的流程（如果有疑问，可以使用基准）更加有效。 Make is great for processing whole folders where you need to apply the same command to each file - it is traditionally used for compiling source code, but it can be used to run arbitrary commands. Make非常适合处理需要在每个文件上应用相同命令的整个文件夹-传统上，它用于编译源代码，但可用于运行任意命令。

It is always best to parallelize in the simplest way possible. 始终最好以最简单的方式进行并行化。 If you have several files and just want to run the same computation on each of them then this is almost certainly the simplest approach. 如果您有多个文件，并且只想对每个文件运行相同的计算，那么几乎可以肯定这是最简单的方法。 If this saturates your computational resources then you can stop here without diving into more sophisticated methods. 如果这会使您的计算资源饱和，那么您可以在这里停留而无需深入研究更复杂的方法。

If this is indeed your situation then you can parallelize done with dask , make , concurrent.futures or any of a variety of other libraries. 如果确实是这种情况，则可以使用dask ， make ， dask 。 concurrent.futures或其他各种库来并行完成。

If there are other concerns, like trying to parallelize the operation itself or making sure you don't run out of memory then you are forced into more sophisticated systems like dask, but this may not be the case. 如果还有其他问题，例如尝试并行化操作本身或确保不耗尽内存，那么您将被迫进入更复杂的系统（例如dask），但这可能并非如此。