简体   繁体   中英

In Python's multiprocessing module, is it good practice to call a worker pool inside a for loop?

Is it a good practice to call pool.map inside a for loop to minimize memory usage?

For example, in my code, I'm trying to minimize memory usage by only processing one directory at a time:

PATH = /dir/files

def readMedia(fname):
   """ Do CPU-intensive task
   """
   pass

def init(queue):
  readMedia.queue = queue

def main():
  print("Starting the scanner in root " + PATH)

  queue = multiprocessing.Queue()
  pool = multiprocessing.Pool(processes=32, initializer=init, initargs=[queue])

  for dirpath, dirnames, filenames in os.walk(PATH):
    full_path_fnames = map(lambda fn: os.path.join(dirpath, fn),
                       filenames)
    pool.map(readMedia, full_path_fnames)

    result = queue.get()
    print(result)

The above code, when tested, actually eats up all my memory even when the script is terminated.

There are probably a few issues here. First, you're using too many processes in your pool. Because you're doing a CPU intensive task, you're only going to get diminishing returns if you start more than multiprocessing.cpu_count() workers; if you've got 32 workers doing CPU intensive tasks but only 4 CPUs, 28 processes are always going to be sitting around doing no work, but wasting memory.

You're probably still seeing high memory usage after killing the script because one or more of the child processes is still running. Take a look at the process list after you kill the main script and make sure none of the children are left behind.

If you're still seeing memory usage growing too high over time, you could try setting the maxtasksperchild keyword argument when you create the pool, which will restart each child process once its run the given number of tasks, releasing any memory that may have leaked.

As for memory usage gains by calling map in a for loop, you do get the advantage of not having to store the results of every single call to readMedia in one in-memory list, which definitely saves memory if there is a huge list of files being iterated over.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM