Python遞歸多處理-線程過多

Question

背景：

Python 3.5.1，Windows 7

我有一個網絡驅動器，可容納大量文件和目錄。 我正在嘗試編寫一個腳本，以盡可能快地解析所有這些腳本，以查找與RegEx匹配的所有文件，並將這些文件復制到我的本地PC上進行檢查。 大約有3500個目錄和子目錄，以及幾百萬個文件。 我試圖使它盡可能通用（即，不將代碼寫入此確切的文件結構），以便將其重用於其他網絡驅動器。 我的代碼在小型網絡驅動器上運行時可以正常工作，這里的問題似乎是可伸縮性。

我已經使用多處理庫嘗試了一些東西，但似乎無法使其可靠地工作。 我的想法是創建一個新的作業來解析每個子目錄，以使其盡快運行。 我有一個遞歸函數，該函數解析目錄中的所有對象，然后為任何子目錄調用自身，並對照RegEx檢查找到的任何文件。

問題：如何在不使用池來實現目標的情況下限制線程/進程的數量？

我嘗試過的

如果僅使用Process作業， RuntimeError: can't start new thread收到錯誤RuntimeError: can't start new thread了幾百個線程后RuntimeError: can't start new thread ，並且它開始斷開連接。 我最終得到大約一半的文件，因為一半的目錄出錯了（下面的代碼）。
為了限制線程總數，我嘗試使用Pool方法，但是根據這個問題，我無法將pool對象傳遞給被調用的方法，這使得不可能實現遞歸。
為了解決這個問題，我嘗試在Pool方法中調用Processes，但是我得到了錯誤daemonic processes are not allowed to have children的錯誤。
我認為，如果我可以限制並發線程的數量，那么我的解決方案將按設計工作。

碼：

import os
import re
import shutil
from multiprocessing import Process, Manager

CheckLocations = ['network drive location 1', 'network drive location 2']
SaveLocation = 'local PC location'
FileNameRegex = re.compile('RegEx here', flags = re.IGNORECASE)


# Loop through all items in folder, and call itself for subfolders.
def ParseFolderContents(path, DebugFileList):

    FolderList = []
    jobs = []
    TempList = []

    if not os.path.exists(path):
        return

    try:

        for item in os.scandir(path):

            try:

                if item.is_dir():
                    p = Process(target=ParseFolderContents, args=(item.path, DebugFileList))
                    jobs.append(p)
                    p.start()

                elif FileNameRegex.search(item.name) != None:
                    DebugFileList.append((path, item.name))

                else:
                    pass

            except Exception as ex:
                if hasattr(ex, 'message'):
                    print(ex.message)
                else:
                    print(ex)
                    # print('Error in file:\t' + item.path)

    except Exception as ex:
        if hasattr(ex, 'message'):
            print(ex.message)
        else:
            print('Error in path:\t' + path)
            pass

        else:
            print('\tToo many threads to restart directory.')

    for job in jobs:
        job.join()


# Save list of debug files.
def SaveDebugFiles(DebugFileList):

    for file in DebugFileList:
        try:
            shutil.copyfile(file[0] + '\\' + file[1], SaveLocation + file[1])
        except PermissionError:
            continue


if __name__ == '__main__':

    with Manager() as manager:

        # Iterate through all directories to make a list of all desired files.
        DebugFileList = manager.list()
        jobs = []

        for path in CheckLocations:
            p = Process(target=ParseFolderContents, args=(path, DebugFileList))
            jobs.append(p)
            p.start()
        for job in jobs:
            job.join()

        print('\n' + str(len(DebugFileList)) + ' files found.\n')
        if len(DebugFileList) == 0:
            quit()

        # Iterate through all debug files and copy them to local PC.
        n = 25 # Number of files to grab for each parallel path.
        TempList = [DebugFileList[i:i + n] for i in range(0, len(DebugFileList), n)] # Split list into small chunks.
        jobs = []

        for item in TempList:
            p = Process(target=SaveDebugFiles, args=(item, ))
            jobs.append(p)
            p.start()

        for job in jobs:
            job.join()

Answer 1

不要輕視池的用途，尤其是當您要控制要創建的進程數時。 他們還負責管理您的工作人員（創建/開始/加入/分配工作塊），並幫助您收集潛在的結果。

當您意識到自己時，會創建太多的進程，直到您似乎耗盡了太多的系統資源，以致無法創建更多的進程。

另外，代碼中新進程的創建受外部因素控制，例如，文件樹中的文件夾數，這使得限制進程數非常困難。 另外，創建新進程會給OS帶來很多開銷，您甚至可能最終將這些開銷浪費在空目錄上。 另外，進程之間的上下文切換非常昂貴。

根據您創建的進程數和給定的文件夾數量，您的進程基本上會坐在那里，在等待CPU時間實際執行某些工作的大部分時間里處於空閑狀態。 除非您擁有一台擁有數千個內核的超級計算機，否則上述CPU時間將引起很多爭論。 即使進程有一些CPU時間可以工作，它也可能會花費大量時間等待I / O。

話雖這么說，您可能希望研究使用線程來完成此類任務。 您可以在代碼中進行一些優化。 從您的示例中，我看不出您為什么會分開標識要復制的文件並將其實際復制到不同任務中的任何原因。 為什么不讓您的工作人員立即復制他們發現的與RE匹配的每個文件？

我將從主線程使用os.walk （我認為比較快）在相關目錄中創建文件列表，然后將該列表卸載到工作池中，該工作池檢查這些文件是否匹配並立即復制這些文件：

import os
import re
from multiprocessing.pool import ThreadPool

search_dirs = ["dir 1", "dir2"]
ptn = re.compile(r"your regex")
# your target dir definition

file_list = []

for topdir in search_dirs:
    for root, dirs, files in os.walk(topdir):
        for file in files:
            file_list.append(os.path.join(root, file))

def copier(path):
    if ptn.match(path):
        # do your shutil.copyfile with the try-except right here
        # obviously I did not want to start mindlessly copying around files on my box :)
        return path

with ThreadPool(processes=10) as pool:
    results = pool.map(copier, file_list)

# print all the processed files. For those that did not match, None is returned
print("\n".join([r for r in results if r]))

附帶說明：不要手動連接路徑（ file[0] + "\\\\" + file[1] ），而要使用os.path.join 。

Answer 2

我無法使它完全按照我的期望工作。 os.walk很慢，我想到的所有其他方法要么速度都差不多，要么由於線程過多而崩潰。

我最終使用了上面發布的類似方法，但是沒有在頂級目錄中啟動遞歸，而是將遞歸向下一兩個層次，直到有多個目錄為止。 然后，它將在串聯的每個目錄中開始遞歸，這限制了足以成功完成的線程數。 執行時間與os.walk類似，這可能會使實現更簡單，更易讀。

Python遞歸多處理-線程過多

問題描述

2 個解決方案

解決方案1
0 2018-10-10 20:00:46

解決方案2
0 已采納 2018-10-25 16:26:54

Python遞歸多處理-線程過多

問題描述

2 個解決方案

解決方案1 0 2018-10-10 20:00:46

解決方案2 0 已采納 2018-10-25 16:26:54

解決方案1
0 2018-10-10 20:00:46

解決方案2
0 已采納 2018-10-25 16:26:54