Python 多處理 - 共享大數據集

Question

我正在嘗試加速受 CPU 限制的 Python 腳本（在 Windows11 上）。 Python 中的威脅似乎沒有在不同的 cpu（核心）上運行，所以我唯一的選擇是多處理。

我有一個很大的字典數據結構（從文件加載后占用空間為 11GB memory），我正在檢查計算值是否在該字典中。 計算的輸入也來自一個文件（大小為 100GB）。 這個輸入我可以分批池映射到進程，沒問題。 但是我不能將字典復制到所有進程，因為沒有足夠的 memory 。 所以我需要找到一種方法讓進程檢查值（實際上是一個字符串）是否在字典中。

有什么建議嗎？

偽程序流程：

--main--
- load dictionary structure from file   # 11GB memory footprint
- ...
- While not all chuncks loaded
-    Load chunk of calcdata from file   # (10.000 lines per chunk)
-    Distribute (map) calcdata-chunck to processes
-    Wait for processes to complete all chunks

--process--
- for each element in subchunk
-    perform calculation
-    check if calculation in dictionary  # here is my problem!
-    store result in file

編輯，在執行下面的評論后，我現在在：

def ReadDictFromFile()
    cnt=0
    print("Reading dictionary from " + dictfilename)
    with open(dictfilename, encoding=("utf-8"), errors=("replace")) as f:
        next(f) #skip first line (header)
        for line in f:
            s = line.rstrip("\n")
            (key,keyvalue) = s.split()
            shared_dict[str(key)]=keyvalue
            cnt = cnt + 1
            if ((cnt % 1000000) == 0): #log each 1000000 where we are
                print(cnt)
                return #temp to speed up testing, not load whole dictionary atm
    print("Done loading dictionary")        


def checkqlist(qlist)
    print(str(os.getpid()) + "-" + str(len(qlist)))
    
    for li in qlist:
        try:
            checkvalue = calculations(li)
        
            (found, keyval) = InMem(checkvalue)
                
            if (found):
                print("FOUND!!! " + checkvalue + ' ' + keyvalue)            
        except Exception as e:
            print("(" + str(os.getpid()) + ")Error log: %s" % repr(e))
            time.sleep(15)


def InMem(checkvalue):
    if(checkvalue in shared_dict):
        return True, shared_dict[checkvalue]
    else:
        return False, ""


if __name__ == "__main__":
    start_time = time.time()

    global shared_dict 
    manager = Manager()
    shared_dict = manager.dict()

    ReadDictFromFile()

    chunksize=5
    nr_of_processes = 10
    with open(filetocheck, encoding=("utf-8"), errors=("replace")) as f:
        qlist = []
        for line in f:
            s = line.rstrip("\n")
            qlist.append(s)
            if (len(qlist) >= (chunksize * nr_of_processes)):
                chunked_list = [qlist[i:i+chunk_size] for i in range(0, len(qlist), chunk_size)]
                try:
                    with multiprocessing.Pool() as pool:
                        pool.map(checkqlist, chunked_list, nr_of_processes)          #problem: qlist is a single string, not a list of about 416 strings.  
                except Exception as e:
                    print("error log: %s" % repr(e))
                    time.sleep(15)
    logit("Completed! " + datetime.datetime.now().strftime("%I:%M%p on %B %d, %Y"))
    print("--- %s seconds ---" % (time.time() - start_time))

Answer 1

你可以為此使用multiprocessing.Manager.dict ，它是最快的 IPC，你可以用來在 python 中的進程之間進行檢查，對於 memory 大小，只需將所有值更改為 None 使其變小，在我的電腦上它可以每秒進行 33k 成員檢查……比普通字典慢大約 400 倍。

manager = Manager()
shared_dict = manager.dict()
shared_dict.update({x:None for x in main_dictionary})
shared_dict["new_element"] = None  # to set another value
del shared_dict["new_element"]  # to delete a certain value

您還可以為此使用專用的內存數據庫，例如 redis，它可以同時處理多個進程的輪詢。

@Sam Mason 建議使用 WSL 和 fork 可能更好，但這個是最便攜的。

編輯：要將它存儲在 children global scope 中，您必須通過初始化程序傳遞它。

def define_global(var):
    global shared_dict
    shared_dict = var
...
if __name__ == "__main__":
...

    with multiprocessing.Pool(initializer=define_global, initargs=(shared_dict ,)) as pool:

Python 多處理 - 共享大數據集

問題描述

1 個解決方案

解決方案1
2 已采納 2022-11-09 17:37:42

Python 多處理 - 共享大數據集

問題描述

1 個解決方案

解決方案1 2 已采納 2022-11-09 17:37:42

解決方案1
2 已采納 2022-11-09 17:37:42