Python - 多處理嵌套字典

Question

我試過用這個問題來回答我的問題，但我沒有任何成功。

我正在使用Python 3.10 。

我的dictionary的結構是這樣的（每個字符串列表都是對產品的評論） ：

{storeNameA : {productA : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string], 
               productB : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string],
               ...,
               product_n : 0 [string, string, ..., string]
                           1 [string, string, ..., string]
                           2 [string, string, ..., string]
                           ...
                           n [string, string, ..., string]},
 storeNameB : {productA : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string], 
               productB : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string],
               ...,
               product_n : 0 [string, string, ..., string]
                           1 [string, string, ..., string]
                           2 [string, string, ..., string]
                           ...
                           n [string, string, ..., string]}}

所以我會訪問一個像dictionary['storeNameA']['productB'][0]這樣的“評論”
或dictionary['storeNameB']['productB'][2] 。 每個商店的每個product都是相同的。

我正在嘗試對整個詞典的每條評論執行一個過程。 我可以使用以下代碼以迭代方式成功執行此操作：

def mapAllValues(nestedDict, func):
    return {storeName: {product: func(prodFile) for product, prodFile in storeDict.items()} for storeName, storeDict in nestedDict.items()}

new_dictionary = mapAllValues(dictionary, lambda reviews: reviews.apply(processFunction))
# processFunction takes a list of string and returns a list of tuples.
# So I end up with a new dictionary where there is now a list of tuples, where there was a list of string.
# {storeName : {product : 0 [(str, str), (str, str), ..., (str, str)]    and so on...

這是一本很長的字典，大約需要 606秒才能完成。
所以，我試圖實現一種並行運行它的方法，但它顯然沒有像我預期的那樣工作，因為它運行時間約為 2170秒。 不過，我確實得到了正確的 output。

我的問題是，請問我在下面的代碼中做錯了什么？ 任何人都可以為我提供解決此問題的方法嗎？

manager = multiprocessing.Manager()
container = manager.dict()
    d = manager.dict(dictionary)
    container = manager.dict()
    for key in d:
        container[key] = manager.dict()
    for key in d['storeNameA']:
        container['storeNameA'][key] = manager.dict()
    for key in d['storeNameB']:
        container['storeNameB'][key] = manager.dict()
    
    with multiprocessing.Pool() as pool:
        pool.starmap(processFunction, [('storeNameA', product, d, container) for product in d['storeNameA']], chunksize=round(42739 / multiprocessing.cpu_count()))
        pool.starmap(processFunction, [('storeNameB', product, d, container) for product in d['storeNameB']], chunksize=round(198560 / multiprocessing.cpu_count()))
    
new_dictionary = dict(container)

我確定我誤解了這實際上是如何工作的，但是正如我所看到的，它應該將每個商店的每個產品分塊並將它們平行化？

無論如何，我想我已經盡可能地解釋了。 如果我需要澄清任何事情，請告訴我。
先感謝您！

Answer 1

首先，雖然創建管理器相對便宜，但如果您不知道它們是如何工作的，訪問它們可能會變得非常昂貴。 長話短說，它們產生了一個單獨的進程，並允許其他進程對存儲在進程內的任何 object 執行命令。 這些命令是按順序讀取的（執行可能有點並行，因為它們在內部使用線程）。

因此，如果兩個或多個進程同時嘗試訪問托管 object（在本例中為字典），則一個進程將阻塞，直到讀取另一個進程的請求。 因此，管理器在使用多處理時並不理想（盡管非常有用），當並行進程需要定期訪問托管 object 時（我假設這里是processFunction的情況），肯定需要重新考慮一些事情。

話雖如此，在這里，您甚至不需要使用 manager 。 從外觀上看， processFunction似乎是一個本地化的 function，它不關心整個字典的 state。 因此，您應該只關心將池中的返回值從主進程本身連接到主字典中，而不是擔心嘗試創建共享 memory 以供池訪問（請記住，池會自動通過完成后分配給主進程的任務的返回值）。

這是您可以做到這一點的一種方法，使用示例字典和processFunction以及比較速度的基准，如果您要連續執行相同的任務。

from multiprocessing import Pool
import string, random, time

def review_generator(size=10):
    chars = string.ascii_uppercase + string.digits
    return ''.join(random.choice(chars) for _ in range(size))

def processFunc(product, prodFile):
    # Return a tuple of the product name and the altered value (a list of tuples)
    return product, [[(element, review_generator()) for element in review] for review in prodFile]


if __name__ == "__main__":

    # Generate example dictionary
    dictionary = {'storeNameA': {}, 'storeNameB': {}}
    for key, _ in dictionary.items():
        for prod_i in range(1000):
            prod = f'product{prod_i}'
            dictionary[key][prod] = [[review_generator() for _ in range(50)] for _ in range(5)]

    # Time the parallel approach
    t = time.time()
    with Pool() as pool:
        a = pool.starmap(processFunc, [(product, prodFile) for product, prodFile in dictionary['storeNameA'].items()])
        b = pool.starmap(processFunc, [(product, prodFile) for product, prodFile in dictionary['storeNameB'].items()])

    print(f"Parallel Approach took {time.time() - t}")

    # Time the serial approach
    t = time.time()

    a = [processFunc(product, prodFile) for product, prodFile in dictionary['storeNameA'].items()]
    b = [processFunc(product, prodFile) for product, prodFile in dictionary['storeNameB'].items()]

    print(f"Serial approach took {time.time() - t}")

Output

Parallel Approach took 1.5318272113800049
Serial approach took 5.765411615371704

一旦從示例processFunction中為a和b中的每個商店獲得結果，您就可以在主進程本身中創建新字典：

new_dictionary = {'storeNameA': {}, 'storeNameB': {}}
for product, prodFile in a:
    new_dictionary['storeNameA'][product] = prodFile
for product, prodFile in b:
    new_dictionary['storeNameB'][product] = prodFile

我還鼓勵您嘗試將任務分配給池提供的工作人員的不同變體（例如imap ），以查看它們是否更適合您的用例並且更有效。

Answer 2

非常感謝@Charchit 和他們的回答，我已經完成了這項工作。 它現在在~154 秒內運行我的龐大數據集，而它迭代需要 ~606 秒。

這是最終代碼，與上面@Charchit 的答案非常相似，但有一些小的改動。

def processFunction(product, listOfReviews):
    # This function handles every review for each product
    toReturn = []
    for review in listOfReviews:
        X = # Do something here...
        toReturn.append(X)
        # X is now a list of tuples [(str, str), (str, str), ...]

    # toReturn is now a list of list
    return product, toReturn

if __name__ == "__main__":

original_dictionary = dict() 
# Where this would be the VERY large dictionary I have. See the structure in my original question.

    new_dictionary = dict()
    for key in original_dictionary:
        new_dictionary[key] = dict()
    for key in original_dictionary['storeNameA']:
        new_dictionary['storeNameA'][key] = list()
    for key in original_dictionary['storeNameB']:
        new_dictionary['storeNameB'][key] = list()

    with multiprocessing.Pool() as pool:
        a = pool.starmap(processFunction, [(product, reviews) for product, reviews in original_dictionary['storeNameA'].items()])
        b = pool.starmap(processFunction, [(product, reviews) for product, reviews in original_dictionary['storeNameB'].items()])

    for product, reviews in a:
        new_dictionary['storeNameA'][product] = reviews
    for product, reviews in b:
        new_dictionary['storeNameB'][product] = reviews

再次感謝@Charchit！

Python - 多處理嵌套字典

問題描述

2 個解決方案

解決方案1
1 已采納 2022-07-30 11:46:59

解決方案2
0 2022-07-30 15:12:23

Python - 多處理嵌套字典

問題描述

2 個解決方案

解決方案1 1 已采納 2022-07-30 11:46:59

解決方案2 0 2022-07-30 15:12:23

解決方案1
1 已采納 2022-07-30 11:46:59

解決方案2
0 2022-07-30 15:12:23