简体   繁体   中英

How to use multiprocessing to construct data from a list of files in python

I am interested in speeding up my file read times by implementing multiprocessing, but I am having trouble getting data back from each process. The order does matter when all the data is put together and I am using Python 3.9.

# read files from file list in the given indices
def read_files(files, folder_path):
    raw_data = []
    # loops through all tif files in the given folder and parses the data.
    for file in files:
        if file[-3:] == "tif":
            curr_frame = Image.open(os.path.join(folder_path, file))
            raw_data.append(np.array(curr_frame))
    return np.asarray(raw_data).astype(np.float64)


def run_processes(folder_path=None):
    if folder_path is None:
        global PATH
        folder_path = PATH
    files = os.listdir(folder_path)

    start = time.time()
    processes = []
    num_files_per = int(len(files) / os.cpu_count())
    for i in range(os.cpu_count()):
        processes.append(Process(target=read_files, args=(files[(i*num_files_per):((i+1)*num_files_per)], folder_path)))
    for process in processes:
        process.start()
    for process in processes:
        process.join()
    end = time.time()
    print(f"Multi: {end - start}")

Any help is much appreciated!

To potentially increase the spead, generate a list of file paths, and write a worker function that takes a single path as its argument and returns its data. If you use that worker with a multiprocessing.Pool , it will take care of the details of returning the data for you.

Keep in mind that you are trading the time to read a file for the overhead of returning the data to the parent process. It is not a given that this is a net improvement.

And then there is the issue of file reads themselves. Since these files are presumably on the same device, you could run into the maximum throughput of the device here.

In general, if the processing you have to do on the images only depends on a single image, it could be worth it to do that processing in in the worker, because that would speed things up.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM