[英]Sharing large pandas DataFrame with multiprocessing for loop in Python
[英]Python multiprocessing - sharing large dataset
我正在嘗試加速受 CPU 限制的 Python 腳本(在 Windows11 上)。 Python 中的威脅似乎沒有在不同的 cpu(核心)上運行,所以我唯一的選擇是多處理。
我有一個很大的字典數據結構(從文件加載后占用空間為 11GB memory),我正在檢查計算值是否在該字典中。 計算的輸入也來自一個文件(大小為 100GB)。 這個輸入我可以分批池映射到進程,沒問題。 但是我不能將字典復制到所有進程,因為沒有足夠的 memory 。 所以我需要找到一種方法讓進程檢查值(實際上是一個字符串)是否在字典中。
有什么建議嗎?
偽程序流程:
--main--
- load dictionary structure from file # 11GB memory footprint
- ...
- While not all chuncks loaded
- Load chunk of calcdata from file # (10.000 lines per chunk)
- Distribute (map) calcdata-chunck to processes
- Wait for processes to complete all chunks
--process--
- for each element in subchunk
- perform calculation
- check if calculation in dictionary # here is my problem!
- store result in file
編輯,在執行下面的評論后,我現在在:
def ReadDictFromFile()
cnt=0
print("Reading dictionary from " + dictfilename)
with open(dictfilename, encoding=("utf-8"), errors=("replace")) as f:
next(f) #skip first line (header)
for line in f:
s = line.rstrip("\n")
(key,keyvalue) = s.split()
shared_dict[str(key)]=keyvalue
cnt = cnt + 1
if ((cnt % 1000000) == 0): #log each 1000000 where we are
print(cnt)
return #temp to speed up testing, not load whole dictionary atm
print("Done loading dictionary")
def checkqlist(qlist)
print(str(os.getpid()) + "-" + str(len(qlist)))
for li in qlist:
try:
checkvalue = calculations(li)
(found, keyval) = InMem(checkvalue)
if (found):
print("FOUND!!! " + checkvalue + ' ' + keyvalue)
except Exception as e:
print("(" + str(os.getpid()) + ")Error log: %s" % repr(e))
time.sleep(15)
def InMem(checkvalue):
if(checkvalue in shared_dict):
return True, shared_dict[checkvalue]
else:
return False, ""
if __name__ == "__main__":
start_time = time.time()
global shared_dict
manager = Manager()
shared_dict = manager.dict()
ReadDictFromFile()
chunksize=5
nr_of_processes = 10
with open(filetocheck, encoding=("utf-8"), errors=("replace")) as f:
qlist = []
for line in f:
s = line.rstrip("\n")
qlist.append(s)
if (len(qlist) >= (chunksize * nr_of_processes)):
chunked_list = [qlist[i:i+chunk_size] for i in range(0, len(qlist), chunk_size)]
try:
with multiprocessing.Pool() as pool:
pool.map(checkqlist, chunked_list, nr_of_processes) #problem: qlist is a single string, not a list of about 416 strings.
except Exception as e:
print("error log: %s" % repr(e))
time.sleep(15)
logit("Completed! " + datetime.datetime.now().strftime("%I:%M%p on %B %d, %Y"))
print("--- %s seconds ---" % (time.time() - start_time))
你可以為此使用multiprocessing.Manager.dict ,它是最快的 IPC,你可以用來在 python 中的進程之間進行檢查,對於 memory 大小,只需將所有值更改為 None 使其變小,在我的電腦上它可以每秒進行 33k 成員檢查……比普通字典慢大約 400 倍。
manager = Manager()
shared_dict = manager.dict()
shared_dict.update({x:None for x in main_dictionary})
shared_dict["new_element"] = None # to set another value
del shared_dict["new_element"] # to delete a certain value
您還可以為此使用專用的內存數據庫,例如 redis,它可以同時處理多個進程的輪詢。
@Sam Mason 建議使用 WSL 和 fork 可能更好,但這個是最便攜的。
編輯:要將它存儲在 children global scope 中,您必須通過初始化程序傳遞它。
def define_global(var):
global shared_dict
shared_dict = var
...
if __name__ == "__main__":
...
with multiprocessing.Pool(initializer=define_global, initargs=(shared_dict ,)) as pool:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.