简体   繁体   中英

python multiprocessing set memory per process

I'm using python to do some processing on text files and am having issues with MemoryError s. Sometimes the file being processed is quite large which means that too much RAM is being used by a multiprocessing Process.

Here is a snippet of my code:

import multiprocessing as mp
import os

def preprocess_file(file_path):
    with open(file_path, "r+") as f:
        file_contents = f.read()
        # modify the file_contents
        # ...
        # overwrite file
        f.seek(0)
        f.write(file_contents)
        f.truncate()

if __name__ == "main":
    with mp.Pool(mp.cpu_count()) as pool:
        pool_processes = []
        # for all files in dir
        for root, dirs, files in os.walk(some_path):
            for f in files:
                pool_processes.append(os.path.join(root, f))
        # start the processes
        pool.map(preprocess_file, pool_processes)

I have tried to use the resource package to set a limit to how much RAM each process can use as shown below but this hasn't fixed the issue, and I still get MemoryError s being raised which leads me to believe it's the pool.map which is causing issues. I was hoping to have each process deal with the exception individually so that the file could be skipped rather than crashing the whole program.

import resource

def preprocess_file(file_path):
    try:
        hard = os.sysconf("SC_PAGE_SIZE") * os.sysconf("SC_PHYS_PAGES") # total bytes of RAM in machine
        soft = (hard - 512 * 1024 * 1024) // mp.cpu_count() # split between each cpu and save 512MB for the system
        resource.setrlimit(resource.RLIMIT_AS, (soft, hard)) # apply limit
        with open(file_path, "r+") as f:
            # ...
    except Exception as e: # bad practice - should be more specific but just a placeholder
        # ...

How can I let an individual process run out of memory while letting the other processes continue unaffected? Ideally I want to catch the exception within the preprocess_file file so that I can log exactly which file caused the error.

Edit: The preprocess_file function does not share data with any other processes so there is no need for shared memory. The function also needs to read the entire file at once as the file is reformatted which cannot be done line by line.

Edit 2: The traceback from the program is below. As you can see, the error doesn't actually point to the file being run, and instead comes from the package's files.

Process ForkPoolWorker-2:
Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 125, in worker
  File "/usr/lib64/python3.6/multiprocessing/queues.py", line 341, in put
  File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
  File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 39, in __init__
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 130, in worker
  File "/usr/lib64/python3.6/multiprocessing/queues.py", line 341, in put
  File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
  File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 39, in __init__
MemoryError

If MemoryError is raised, the worker processmay or may not be able to recover from the situation. If it do, as @Thomas suggest, catch the MemoryError somewhere.

import multiprocessing as mp
from time import sleep


def initializer():
    # Probably set the memory limit here
    pass


def worker(i):
    sleep(1)

    try:
        if i % 2 == 0:
            raise MemoryError
    except MemoryError as ex:
        return str(ex)

    return i


if __name__ == '__main__':
    with mp.Pool(2, initializer=initializer) as pool:
        tasks = range(10)
        results = pool.map(worker, tasks)
        print(results)

If the worker cannot recover, the whole pool is unlikely working. For example, change worker to do a force exit:

def worker(i):
    sleep(1)

    try:
        if i % 2 == 0:
            raise MemoryError
        elif i == 5:
            import sys
            sys.exit()
    except MemoryError as ex:
        return str(ex)

    return i

the Pool.map never return and block forever.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM