簡體   English   中英

Python - 多處理嵌套字典

[英]Python - multiprocessing a nested dictionary

我試過用這個問題來回答我的問題,但我沒有任何成功。

我正在使用Python 3.10

我的dictionary的結構是這樣的(每個字符串列表都是對產品的評論)

{storeNameA : {productA : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string], 
               productB : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string],
               ...,
               product_n : 0 [string, string, ..., string]
                           1 [string, string, ..., string]
                           2 [string, string, ..., string]
                           ...
                           n [string, string, ..., string]},
 storeNameB : {productA : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string], 
               productB : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string],
               ...,
               product_n : 0 [string, string, ..., string]
                           1 [string, string, ..., string]
                           2 [string, string, ..., string]
                           ...
                           n [string, string, ..., string]}}

所以我會訪問一個像dictionary['storeNameA']['productB'][0]這樣的“評論”
dictionary['storeNameB']['productB'][2] 每個商店的每個product都是相同的。

我正在嘗試對整個詞典的每條評論執行一個過程。 我可以使用以下代碼以迭代方式成功執行此操作:

def mapAllValues(nestedDict, func):
    return {storeName: {product: func(prodFile) for product, prodFile in storeDict.items()} for storeName, storeDict in nestedDict.items()}

new_dictionary = mapAllValues(dictionary, lambda reviews: reviews.apply(processFunction))
# processFunction takes a list of string and returns a list of tuples.
# So I end up with a new dictionary where there is now a list of tuples, where there was a list of string.
# {storeName : {product : 0 [(str, str), (str, str), ..., (str, str)]    and so on...

這是一本很長的字典,大約需要 606秒才能完成。
所以,我試圖實現一種並行運行它的方法,但它顯然沒有像我預期的那樣工作,因為它運行時間約為 2170秒。 不過,我確實得到了正確的 output。

我的問題是,請問我在下面的代碼中做錯了什么? 任何人都可以為我提供解決此問題的方法嗎?

manager = multiprocessing.Manager()
container = manager.dict()
    d = manager.dict(dictionary)
    container = manager.dict()
    for key in d:
        container[key] = manager.dict()
    for key in d['storeNameA']:
        container['storeNameA'][key] = manager.dict()
    for key in d['storeNameB']:
        container['storeNameB'][key] = manager.dict()
    
    with multiprocessing.Pool() as pool:
        pool.starmap(processFunction, [('storeNameA', product, d, container) for product in d['storeNameA']], chunksize=round(42739 / multiprocessing.cpu_count()))
        pool.starmap(processFunction, [('storeNameB', product, d, container) for product in d['storeNameB']], chunksize=round(198560 / multiprocessing.cpu_count()))
    
new_dictionary = dict(container)

我確定我誤解了這實際上是如何工作的,但是正如我所看到的,它應該將每個商店的每個產品分塊並將它們平行化?

無論如何,我想我已經盡可能地解釋了。 如果我需要澄清任何事情,請告訴我。
先感謝您!

首先,雖然創建管理器相對便宜,但如果您不知道它們是如何工作的,訪問它們可能會變得非常昂貴。 長話短說,它們產生了一個單獨的進程,並允許其他進程對存儲在進程內的任何 object 執行命令。 這些命令是按順序讀取的(執行可能有點並行,因為它們在內部使用線程)。

因此,如果兩個或多個進程同時嘗試訪問托管 object(在本例中為字典),則一個進程將阻塞,直到讀取另一個進程的請求。 因此,管理器在使用多處理時並不理想(盡管非常有用),當並行進程需要定期訪問托管 object 時(我假設這里是processFunction的情況),肯定需要重新考慮一些事情。

話雖如此,在這里,您甚至不需要使用 manager 從外觀上看, processFunction似乎是一個本地化的 function,它不關心整個字典的 state。 因此,您應該只關心將池中的返回值從主進程本身連接到主字典中,而不是擔心嘗試創建共享 memory 以供池訪問(請記住,池會自動通過完成后分配給主進程的任務的返回值)。

這是您可以做到這一點的一種方法,使用示例字典和processFunction以及比較速度的基准,如果您要連續執行相同的任務。

from multiprocessing import Pool
import string, random, time

def review_generator(size=10):
    chars = string.ascii_uppercase + string.digits
    return ''.join(random.choice(chars) for _ in range(size))

def processFunc(product, prodFile):
    # Return a tuple of the product name and the altered value (a list of tuples)
    return product, [[(element, review_generator()) for element in review] for review in prodFile]


if __name__ == "__main__":

    # Generate example dictionary
    dictionary = {'storeNameA': {}, 'storeNameB': {}}
    for key, _ in dictionary.items():
        for prod_i in range(1000):
            prod = f'product{prod_i}'
            dictionary[key][prod] = [[review_generator() for _ in range(50)] for _ in range(5)]

    # Time the parallel approach
    t = time.time()
    with Pool() as pool:
        a = pool.starmap(processFunc, [(product, prodFile) for product, prodFile in dictionary['storeNameA'].items()])
        b = pool.starmap(processFunc, [(product, prodFile) for product, prodFile in dictionary['storeNameB'].items()])

    print(f"Parallel Approach took {time.time() - t}")

    # Time the serial approach
    t = time.time()

    a = [processFunc(product, prodFile) for product, prodFile in dictionary['storeNameA'].items()]
    b = [processFunc(product, prodFile) for product, prodFile in dictionary['storeNameB'].items()]

    print(f"Serial approach took {time.time() - t}")

Output

Parallel Approach took 1.5318272113800049
Serial approach took 5.765411615371704

一旦從示例processFunction中為ab中的每個商店獲得結果,您就可以在主進程本身中創建新字典:

new_dictionary = {'storeNameA': {}, 'storeNameB': {}}
for product, prodFile in a:
    new_dictionary['storeNameA'][product] = prodFile
for product, prodFile in b:
    new_dictionary['storeNameB'][product] = prodFile

我還鼓勵您嘗試將任務分配給池提供的工作人員的不同變體(例如imap ),以查看它們是否更適合您的用例並且更有效。

非常感謝@Charchit 和他們的回答,我已經完成了這項工作。 它現在在~154 秒內運行我的龐大數據集,而它迭代需要 ~606 秒。

這是最終代碼,與上面@Charchit 的答案非常相似,但有一些的改動。

def processFunction(product, listOfReviews):
    # This function handles every review for each product
    toReturn = []
    for review in listOfReviews:
        X = # Do something here...
        toReturn.append(X)
        # X is now a list of tuples [(str, str), (str, str), ...]

    # toReturn is now a list of list
    return product, toReturn

if __name__ == "__main__":

original_dictionary = dict() 
# Where this would be the VERY large dictionary I have. See the structure in my original question.

    new_dictionary = dict()
    for key in original_dictionary:
        new_dictionary[key] = dict()
    for key in original_dictionary['storeNameA']:
        new_dictionary['storeNameA'][key] = list()
    for key in original_dictionary['storeNameB']:
        new_dictionary['storeNameB'][key] = list()

    with multiprocessing.Pool() as pool:
        a = pool.starmap(processFunction, [(product, reviews) for product, reviews in original_dictionary['storeNameA'].items()])
        b = pool.starmap(processFunction, [(product, reviews) for product, reviews in original_dictionary['storeNameB'].items()])

    for product, reviews in a:
        new_dictionary['storeNameA'][product] = reviews
    for product, reviews in b:
        new_dictionary['storeNameB'][product] = reviews

再次感謝@Charchit!

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM