简体   繁体   中英

Python: Why reading file in multiple parallel processes is slower than in 1 single process/thread?

What could be the reason that Python multiprocessing is slower that a single thread while reading binary files?

def getBinaryData(procnum, filename, pointer_from, pointer_to):
    binary_values = []

    start = time.time()

    with open(filename, 'rb') as fileobject:
        # read file byte by byte
        fileobject.seek(pointer_from)
        data = fileobject.read(1)

        while data != b'' or pointer_position < pointer_to:
            #binary_values.append(ord(data))
            data = fileobject.read(1)
            pointer_position = fileobject.tell()
    end = time.time()
    print("proc ", procnum, " finished in: ", end - start)
    return binary_values

def worker(procnum, last_proc_num, file_path, bytes_chunk, return_dict):
    """worker function"""
    print(str(procnum) + " represent!")
    if procnum == 0:
        greyscale_data = getBinaryData(procnum, file_path, 0, bytes_chunk)
    elif procnum == last_proc_num:
        greyscale_data = getBinaryData(procnum, file_path, procnum * bytes_chunk, os.stat(file_path).st_size)
    else:
        greyscale_data = getBinaryData(procnum, file_path, procnum * bytes_chunk, (procnum+1) * bytes_chunk)

    size = get_size(len(greyscale_data))

    return_dict[procnum] = procnum


def main():
    cpu_cores = 10
    file_path = r"test_binary_file.exe"
    file_stats = os.stat(file_path)
    file_size = file_stats.st_size

    manager = multiprocessing.Manager()
    return_dict = manager.dict()
    jobs = []
    for i in range(cpu_cores):
        p = multiprocessing.Process(target=worker, args=(i, cpu_cores-1, file_path, int(file_size/cpu_cores), return_dict))
        jobs.append(p)
        p.start()

    for proc in jobs:
        proc.join()
    print(return_dict.values())

While single-threaded process finishes to read 10mb file in ~30seconds - the multiprocesses solution gets it done way slower.

Python log output:

10 processes

1 process

Ruled-out issues:

  • IO bottleneck (NVMe SSD)
  • CPU/RAM bottleneck (16 cores, 4.4 GHz / 64GB 3200GHz RAM)

Processes are heavy and it takes a lot of time to creates and ends a process so in my opinion reading the file is really fast but most of the time it takes to create and terminate the process.

To reads a lot of file it's enought to use multithreading because thread are light and GIL works like True multithreading for i/o operations.

It's recommended to use multiprocessing when you need to execute heavy operations. 在此处输入图像描述 Source of picture: https://youtu.be/kRy_UwUhBpo?t=763 He told that the img is from fastpython.com .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM