简体   繁体   English

Python递归多处理-线程过多

[英]Python Recursive Multiprocessing - too many threads

Background: 背景:

Python 3.5.1, Windows 7 Python 3.5.1,Windows 7

I have a network drive that holds a large number of files and directories. 我有一个网络驱动器,可容纳大量文件和目录。 I'm trying to write a script to parse through all of these as quickly as possible to find all files that match a RegEx, and copy these files to my local PC for review. 我正在尝试编写一个脚本,以尽可能快地解析所有这些脚本,以查找与RegEx匹配的所有文件,并将这些文件复制到我的本地PC上进行检查。 There are about 3500 directories and subdirectories, and a few million files. 大约有3500个目录和子目录,以及几百万个文件。 I'm trying to make this as generic as possible (ie, not writing code to this exact file structure) in order to reuse this for other network drives. 我试图使它尽可能通用(即,不将代码写入此确切的文件结构),以便将其重用于其他网络驱动器。 My code works when run against a small network drive, the issue here seems to be scalability. 我的代码在小型网络驱动器上运行时可以正常工作,这里的问题似乎是可伸缩性。

I've tried a few things using the multiprocessing library and can't seem to get it to work reliably. 我已经使用多处理库尝试了一些东西,但似乎无法使其可靠地工作。 My idea was to create a new job to parse through each subdirectory to work as quickly as possible. 我的想法是创建一个新的作业来解析每个子目录,以使其尽快运行。 I have a recursive function that parses through all objects in a directory, then calls itself for any subdirectories, and checks any files it finds against the RegEx. 我有一个递归函数,该函数解析目录中的所有对象,然后为任何子目录调用自身,并对照RegEx检查找到的任何文件。

Question: how can I limit the number of threads/processes without using Pools to achieve my goal? 问题:如何在不使用池来实现目标的情况下限制线程/进程的数量?

What I've tried: 我尝试过的

  • If I only use Process jobs, I get the error RuntimeError: can't start new thread after more than a few hundred threads start, and it starts dropping connections. 如果仅使用Process作业, RuntimeError: can't start new thread收到错误RuntimeError: can't start new thread了几百个线程后RuntimeError: can't start new thread ,并且它开始断开连接。 I end up with about half the files found, as half of the directories error out (code for this below). 我最终得到大约一半的文件,因为一半的目录出错了(下面的代码)。
  • To limit the number of total threads, I tried to use the Pool methods, but I can't pass pool objects to called methods according to this question , which makes the recursion implementation not possible. 为了限制线程总数,我尝试使用Pool方法,但是根据这个问题 ,我无法将pool对象传递给被调用的方法,这使得不可能实现递归。
  • To fix that, I tried to call Processes inside the Pool methods, but I get the error daemonic processes are not allowed to have children . 为了解决这个问题,我尝试在Pool方法中调用Processes,但是我得到了错误daemonic processes are not allowed to have children的错误。
  • I think that if I can limit the number of concurrent threads, then my solution will work as designed. 我认为,如果我可以限制并发线程的数量,那么我的解决方案将按设计工作。

Code: 码:

import os
import re
import shutil
from multiprocessing import Process, Manager

CheckLocations = ['network drive location 1', 'network drive location 2']
SaveLocation = 'local PC location'
FileNameRegex = re.compile('RegEx here', flags = re.IGNORECASE)


# Loop through all items in folder, and call itself for subfolders.
def ParseFolderContents(path, DebugFileList):

    FolderList = []
    jobs = []
    TempList = []

    if not os.path.exists(path):
        return

    try:

        for item in os.scandir(path):

            try:

                if item.is_dir():
                    p = Process(target=ParseFolderContents, args=(item.path, DebugFileList))
                    jobs.append(p)
                    p.start()

                elif FileNameRegex.search(item.name) != None:
                    DebugFileList.append((path, item.name))

                else:
                    pass

            except Exception as ex:
                if hasattr(ex, 'message'):
                    print(ex.message)
                else:
                    print(ex)
                    # print('Error in file:\t' + item.path)

    except Exception as ex:
        if hasattr(ex, 'message'):
            print(ex.message)
        else:
            print('Error in path:\t' + path)
            pass

        else:
            print('\tToo many threads to restart directory.')

    for job in jobs:
        job.join()


# Save list of debug files.
def SaveDebugFiles(DebugFileList):

    for file in DebugFileList:
        try:
            shutil.copyfile(file[0] + '\\' + file[1], SaveLocation + file[1])
        except PermissionError:
            continue


if __name__ == '__main__':

    with Manager() as manager:

        # Iterate through all directories to make a list of all desired files.
        DebugFileList = manager.list()
        jobs = []

        for path in CheckLocations:
            p = Process(target=ParseFolderContents, args=(path, DebugFileList))
            jobs.append(p)
            p.start()
        for job in jobs:
            job.join()

        print('\n' + str(len(DebugFileList)) + ' files found.\n')
        if len(DebugFileList) == 0:
            quit()

        # Iterate through all debug files and copy them to local PC.
        n = 25 # Number of files to grab for each parallel path.
        TempList = [DebugFileList[i:i + n] for i in range(0, len(DebugFileList), n)] # Split list into small chunks.
        jobs = []

        for item in TempList:
            p = Process(target=SaveDebugFiles, args=(item, ))
            jobs.append(p)
            p.start()

        for job in jobs:
            job.join()

Don't disdain the usefulness of pools, especially when you want to control the number of processes to create. 不要轻视池的用途,尤其是当您要控制要创建的进程数时。 They also take care of managing your workers (create/start/join/distribute chunks of work) and help you collect potential results. 他们还负责管理您的工作人员(创建/开始/加入/分配工作块),并帮助您收集潜在的结果。

As you have realized yourself, you create way too many processes, up to a point where you seem to exhaust so many system resources that you cannot create more processes. 当您意识到自己时,会创建太多的进程,直到您似乎耗尽了太多的系统资源,以致无法创建更多的进程。

Additionally, the creation of new processes in your code is controlled by outside factors, ie the number of folders in your file trees, which makes it very difficult to limit the number of processes. 另外,代码中新进程的创建受外部因素控制,例如,文件树中的文件夹数,这使得限制进程数非常困难。 Also, creating a new process comes with quite some overhead on the OS and you might even end up wasting that overhead on empty directories. 另外,创建新进程会给OS带来很多开销,您甚至可能最终将这些开销浪费在空目录上。 Plus, context switches between processes are quite costly. 另外,进程之间的上下文切换非常昂贵。

With the number of processes you create, given the number of folders you stated, your processes will basically just sit there and idle most of the time while they are waiting for a share of CPU time to actually do some work. 根据您创建的进程数和给定的文件夹数量,您的进程基本上会坐在那里,在等待CPU时间实际执行某些工作的大部分时间里处于空闲状态。 There will be a lot of contention for said CPU time, unless you have a supercomputer with thousands of cores at your disposal. 除非您拥有一台拥有数千个内核的超级计算机,否则上述CPU时间将引起很多争论。 And even when a process gets some CPU time to work, it will likely spend a quite a bit of that time waiting for I/O. 即使进程有一些CPU时间可以工作,它也可能会花费大量时间等待I / O。

That being said, you'll probably want to look into using threads for such a task. 话虽这么说,您可能希望研究使用线程来完成此类任务。 And you could do some optimization in your code. 您可以在代码中进行一些优化。 From your example, I don't see any reason why you would split identifying the files to copy and actually copying them into different tasks. 从您的示例中,我看不出您为什么会分开标识要复制的文件并将其实际复制到不同任务中的任何原因。 Why not let your workers copy each file they found matching the RE right away? 为什么不让您的工作人员立即复制他们发现的与RE匹配的每个文件?

I'd create a list of files in the directories in question using os.walk (which I consider reasonably fast) from the main thread and then offload that list to a pool of workers that checks these files for matches and copies those right away: 我将从主线程使用os.walk (我认为比较快)在相关目录中创建文件列表,然后将该列表卸载到工作池中,该工作池检查这些文件是否匹配并立即复制这些文件:

import os
import re
from multiprocessing.pool import ThreadPool

search_dirs = ["dir 1", "dir2"]
ptn = re.compile(r"your regex")
# your target dir definition

file_list = []

for topdir in search_dirs:
    for root, dirs, files in os.walk(topdir):
        for file in files:
            file_list.append(os.path.join(root, file))

def copier(path):
    if ptn.match(path):
        # do your shutil.copyfile with the try-except right here
        # obviously I did not want to start mindlessly copying around files on my box :)
        return path

with ThreadPool(processes=10) as pool:
    results = pool.map(copier, file_list)

# print all the processed files. For those that did not match, None is returned
print("\n".join([r for r in results if r]))

On a side note: don't concatenate your paths manually ( file[0] + "\\\\" + file[1] ), rather use os.path.join for this. 附带说明:不要手动连接路径( file[0] + "\\\\" + file[1] ),而要使用os.path.join

I was unable to get this to work exactly as I desired. 我无法使它完全按照我的期望工作。 os.walk was slow, and every other method I thought of was either a similar speed or crashed due to too many threads. os.walk很慢,我想到的所有其他方法要么速度都差不多,要么由于线程过多而崩溃。

I ended up using a similar method that I posted above, but instead of starting the recursion at the top level directory, it would go down one or two levels until there were several directories. 我最终使用了上面发布的类似方法,但是没有在顶级目录中启动递归,而是将递归向下一两个层次,直到有多个目录为止。 It would then start the recursion at each of these directories in series, which limited the number of threads enough to finish successfully. 然后,它将在串联的每个目录中开始递归,这限制了足以成功完成的线程数。 Execution time is similar to os.walk, which would probably make for a simpler and more readable implementation. 执行时间与os.walk类似,这可能会使实现更简单,更易读。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM