如何在 Python 中使用 Multiprocessing 并行执行以下代码

Question

I have a function generate(file_path) which returns an integer index and a numpy array.我有一个 function generate(file_path) ，它返回一个 integer 索引和一个 numpy 数组。 The simplified of generate function is as follows:生成function的简化如下：

def generate(file_path):
  temp = np.load(file_path)
  #get index from the string file_path
  idx = int(file_path.split["_"][0])
  #do some mathematical operation on temp
  result = operate(temp)
  return idx, result

I need to glob through a directory and collect the results of generate(file_path) into a hdf5 file.我需要遍历一个目录并将generate(file_path)的结果收集到一个 hdf5 文件中。 My serialization code is as follows:我的序列化代码如下：

for path in glob.glob(directory):
    idx, result = generate(path)

    hdf5_file["results"][idx,:] = result
    
hdf5_file.close()

I hope to write a multi-thread or multi-process code to speed up the above code.我希望写一个多线程或多进程的代码来加速上面的代码。 How could I modify it?我该如何修改它？ Pretty thanks!非常感谢！

My try is to modify my generate function and to modify my "main" as follows:我的尝试是修改我的生成 function 并修改我的“主要”，如下所示：

def generate(file_path):
    temp = np.load(file_path)
    #get index from the string file_path
    idx = int(file_path.split["_"][0])
    #do some mathematical operation on temp
    result = operate(temp)
      
    hdf5_path = "./result.hdf5"
    hdf5_file = h5py.File(hdf5_path, 'w')
    hdf5_file["results"][idx,:] = result

    hdf5_file.close()

if __name__ == '__main__':
    ##construct hdf5 file
    hdf5_path = "./output.hdf5"
    hdf5_file = h5py.File(hdf5_path, 'w')
    hdf5_file.create_dataset("results", [2000,15000], np.uint8)

    hdf5_file.close()

    path_ = "./compute/*"
    p = Pool(mp.cpu_count())
    p.map(generate, glob.glob(path_))
    hdf5_file.close()
   
    print("finished")

However, it does not work.但是，它不起作用。 It will throw error它会抛出错误

KeyError: "Unable to open object (object 'results' doesn't exist)"

Answer 1

I detected some errors in initialising the dataset after examining your code;在检查您的代码后，我在初始化数据集时发现了一些错误；

You produced the hdf5 file with the path ""./result.hdf5" inside the generate function.您在生成 function 中生成了路径为“./result.hdf5”的hdf5 文件。

However, I think you neglected to create a "results" dataset beneath that file, as that is what is causing the Object Does Not Exist issue .但是，我认为您忽略了在该文件下创建一个“结果”数据集，因为这就是导致Object Does Not Exist 问题的原因。

Kindly reply if you still face the same issue with error message如果您仍然面临与错误消息相同的问题，请回复

Answer 2

You can use a thread or process pool to execute multiple function calls concurrently.您可以使用线程或进程池同时执行多个 function 调用。 Here is an example which uses a process pool:这是一个使用进程池的示例：

from concurrent.futures import ProcessPoolExecutor
from time import sleep


def generate(file_path: str) -> int:
    sleep(1.0)
    return file_path.split("_")[1]


def main():
    file_paths = ["path_1", "path_2", "path_3"]
    
    with ProcessPoolExecutor() as pool:
        results = pool.map(generate, file_paths)
        
        for result in results:
            # Write to the HDF5 file
            print(result)
    

if __name__ == "__main__":
    main()

Note that you should not write to the same HDF5 file concurrently, ie the file writing should not append in the generate function.请注意，您不应同时写入同一个 HDF5 文件，即文件写入不应generate function 中的 append。

如何在 Python 中使用 Multiprocessing 并行执行以下代码

问题描述

2 个解决方案

解决方案1
0 2022-12-25 12:07:19

解决方案2
0 已采纳 2022-12-28 19:31:37

如何在 Python 中使用 Multiprocessing 并行执行以下代码

问题描述

2 个解决方案

解决方案1 0 2022-12-25 12:07:19

解决方案2 0 已采纳 2022-12-28 19:31:37

解决方案1
0 2022-12-25 12:07:19

解决方案2
0 已采纳 2022-12-28 19:31:37