Python Recursive Multiprocessing - too many threads

Question

Background:

Python 3.5.1, Windows 7

I have a network drive that holds a large number of files and directories. I'm trying to write a script to parse through all of these as quickly as possible to find all files that match a RegEx, and copy these files to my local PC for review. There are about 3500 directories and subdirectories, and a few million files. I'm trying to make this as generic as possible (ie, not writing code to this exact file structure) in order to reuse this for other network drives. My code works when run against a small network drive, the issue here seems to be scalability.

I've tried a few things using the multiprocessing library and can't seem to get it to work reliably. My idea was to create a new job to parse through each subdirectory to work as quickly as possible. I have a recursive function that parses through all objects in a directory, then calls itself for any subdirectories, and checks any files it finds against the RegEx.

Question: how can I limit the number of threads/processes without using Pools to achieve my goal?

What I've tried:

If I only use Process jobs, I get the error RuntimeError: can't start new thread after more than a few hundred threads start, and it starts dropping connections. I end up with about half the files found, as half of the directories error out (code for this below).
To limit the number of total threads, I tried to use the Pool methods, but I can't pass pool objects to called methods according to this question , which makes the recursion implementation not possible.
To fix that, I tried to call Processes inside the Pool methods, but I get the error daemonic processes are not allowed to have children .
I think that if I can limit the number of concurrent threads, then my solution will work as designed.

Code:

import os
import re
import shutil
from multiprocessing import Process, Manager

CheckLocations = ['network drive location 1', 'network drive location 2']
SaveLocation = 'local PC location'
FileNameRegex = re.compile('RegEx here', flags = re.IGNORECASE)


# Loop through all items in folder, and call itself for subfolders.
def ParseFolderContents(path, DebugFileList):

    FolderList = []
    jobs = []
    TempList = []

    if not os.path.exists(path):
        return

    try:

        for item in os.scandir(path):

            try:

                if item.is_dir():
                    p = Process(target=ParseFolderContents, args=(item.path, DebugFileList))
                    jobs.append(p)
                    p.start()

                elif FileNameRegex.search(item.name) != None:
                    DebugFileList.append((path, item.name))

                else:
                    pass

            except Exception as ex:
                if hasattr(ex, 'message'):
                    print(ex.message)
                else:
                    print(ex)
                    # print('Error in file:\t' + item.path)

    except Exception as ex:
        if hasattr(ex, 'message'):
            print(ex.message)
        else:
            print('Error in path:\t' + path)
            pass

        else:
            print('\tToo many threads to restart directory.')

    for job in jobs:
        job.join()


# Save list of debug files.
def SaveDebugFiles(DebugFileList):

    for file in DebugFileList:
        try:
            shutil.copyfile(file[0] + '\\' + file[1], SaveLocation + file[1])
        except PermissionError:
            continue


if __name__ == '__main__':

    with Manager() as manager:

        # Iterate through all directories to make a list of all desired files.
        DebugFileList = manager.list()
        jobs = []

        for path in CheckLocations:
            p = Process(target=ParseFolderContents, args=(path, DebugFileList))
            jobs.append(p)
            p.start()
        for job in jobs:
            job.join()

        print('\n' + str(len(DebugFileList)) + ' files found.\n')
        if len(DebugFileList) == 0:
            quit()

        # Iterate through all debug files and copy them to local PC.
        n = 25 # Number of files to grab for each parallel path.
        TempList = [DebugFileList[i:i + n] for i in range(0, len(DebugFileList), n)] # Split list into small chunks.
        jobs = []

        for item in TempList:
            p = Process(target=SaveDebugFiles, args=(item, ))
            jobs.append(p)
            p.start()

        for job in jobs:
            job.join()

Answer 1

Don't disdain the usefulness of pools, especially when you want to control the number of processes to create. They also take care of managing your workers (create/start/join/distribute chunks of work) and help you collect potential results.

As you have realized yourself, you create way too many processes, up to a point where you seem to exhaust so many system resources that you cannot create more processes.

Additionally, the creation of new processes in your code is controlled by outside factors, ie the number of folders in your file trees, which makes it very difficult to limit the number of processes. Also, creating a new process comes with quite some overhead on the OS and you might even end up wasting that overhead on empty directories. Plus, context switches between processes are quite costly.

With the number of processes you create, given the number of folders you stated, your processes will basically just sit there and idle most of the time while they are waiting for a share of CPU time to actually do some work. There will be a lot of contention for said CPU time, unless you have a supercomputer with thousands of cores at your disposal. And even when a process gets some CPU time to work, it will likely spend a quite a bit of that time waiting for I/O.

That being said, you'll probably want to look into using threads for such a task. And you could do some optimization in your code. From your example, I don't see any reason why you would split identifying the files to copy and actually copying them into different tasks. Why not let your workers copy each file they found matching the RE right away?

I'd create a list of files in the directories in question using os.walk (which I consider reasonably fast) from the main thread and then offload that list to a pool of workers that checks these files for matches and copies those right away:

import os
import re
from multiprocessing.pool import ThreadPool

search_dirs = ["dir 1", "dir2"]
ptn = re.compile(r"your regex")
# your target dir definition

file_list = []

for topdir in search_dirs:
    for root, dirs, files in os.walk(topdir):
        for file in files:
            file_list.append(os.path.join(root, file))

def copier(path):
    if ptn.match(path):
        # do your shutil.copyfile with the try-except right here
        # obviously I did not want to start mindlessly copying around files on my box :)
        return path

with ThreadPool(processes=10) as pool:
    results = pool.map(copier, file_list)

# print all the processed files. For those that did not match, None is returned
print("\n".join([r for r in results if r]))

On a side note: don't concatenate your paths manually ( file[0] + "\\\\" + file[1] ), rather use os.path.join for this.

Answer 2

I was unable to get this to work exactly as I desired. os.walk was slow, and every other method I thought of was either a similar speed or crashed due to too many threads.

I ended up using a similar method that I posted above, but instead of starting the recursion at the top level directory, it would go down one or two levels until there were several directories. It would then start the recursion at each of these directories in series, which limited the number of threads enough to finish successfully. Execution time is similar to os.walk, which would probably make for a simpler and more readable implementation.

Python Recursive Multiprocessing - too many threads

Question

2 answers

solution1
0 2018-10-10 20:00:46

solution2
0 ACCPTED 2018-10-25 16:26:54

Python Recursive Multiprocessing - too many threads

Question

2 answers

solution1 0 2018-10-10 20:00:46

solution2 0 ACCPTED 2018-10-25 16:26:54

solution1
0 2018-10-10 20:00:46

solution2
0 ACCPTED 2018-10-25 16:26:54