Python - 多处理嵌套字典

Question

我试过用这个问题来回答我的问题，但我没有任何成功。

我正在使用Python 3.10 。

我的dictionary的结构是这样的（每个字符串列表都是对产品的评论） ：

{storeNameA : {productA : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string], 
               productB : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string],
               ...,
               product_n : 0 [string, string, ..., string]
                           1 [string, string, ..., string]
                           2 [string, string, ..., string]
                           ...
                           n [string, string, ..., string]},
 storeNameB : {productA : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string], 
               productB : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string],
               ...,
               product_n : 0 [string, string, ..., string]
                           1 [string, string, ..., string]
                           2 [string, string, ..., string]
                           ...
                           n [string, string, ..., string]}}

所以我会访问一个像dictionary['storeNameA']['productB'][0]这样的“评论”
或dictionary['storeNameB']['productB'][2] 。 每个商店的每个product都是相同的。

我正在尝试对整个词典的每条评论执行一个过程。 我可以使用以下代码以迭代方式成功执行此操作：

def mapAllValues(nestedDict, func):
    return {storeName: {product: func(prodFile) for product, prodFile in storeDict.items()} for storeName, storeDict in nestedDict.items()}

new_dictionary = mapAllValues(dictionary, lambda reviews: reviews.apply(processFunction))
# processFunction takes a list of string and returns a list of tuples.
# So I end up with a new dictionary where there is now a list of tuples, where there was a list of string.
# {storeName : {product : 0 [(str, str), (str, str), ..., (str, str)]    and so on...

这是一本很长的字典，大约需要 606秒才能完成。
所以，我试图实现一种并行运行它的方法，但它显然没有像我预期的那样工作，因为它运行时间约为 2170秒。 不过，我确实得到了正确的 output。

我的问题是，请问我在下面的代码中做错了什么？ 任何人都可以为我提供解决此问题的方法吗？

manager = multiprocessing.Manager()
container = manager.dict()
    d = manager.dict(dictionary)
    container = manager.dict()
    for key in d:
        container[key] = manager.dict()
    for key in d['storeNameA']:
        container['storeNameA'][key] = manager.dict()
    for key in d['storeNameB']:
        container['storeNameB'][key] = manager.dict()
    
    with multiprocessing.Pool() as pool:
        pool.starmap(processFunction, [('storeNameA', product, d, container) for product in d['storeNameA']], chunksize=round(42739 / multiprocessing.cpu_count()))
        pool.starmap(processFunction, [('storeNameB', product, d, container) for product in d['storeNameB']], chunksize=round(198560 / multiprocessing.cpu_count()))
    
new_dictionary = dict(container)

我确定我误解了这实际上是如何工作的，但是正如我所看到的，它应该将每个商店的每个产品分块并将它们平行化？

无论如何，我想我已经尽可能地解释了。 如果我需要澄清任何事情，请告诉我。
先感谢您！

Answer 1

首先，虽然创建管理器相对便宜，但如果您不知道它们是如何工作的，访问它们可能会变得非常昂贵。 长话短说，它们产生了一个单独的进程，并允许其他进程对存储在进程内的任何 object 执行命令。 这些命令是按顺序读取的（执行可能有点并行，因为它们在内部使用线程）。

因此，如果两个或多个进程同时尝试访问托管 object（在本例中为字典），则一个进程将阻塞，直到读取另一个进程的请求。 因此，管理器在使用多处理时并不理想（尽管非常有用），当并行进程需要定期访问托管 object 时（我假设这里是processFunction的情况），肯定需要重新考虑一些事情。

话虽如此，在这里，您甚至不需要使用 manager 。 从外观上看， processFunction似乎是一个本地化的 function，它不关心整个字典的 state。 因此，您应该只关心将池中的返回值从主进程本身连接到主字典中，而不是担心尝试创建共享 memory 以供池访问（请记住，池会自动通过完成后分配给主进程的任务的返回值）。

这是您可以做到这一点的一种方法，使用示例字典和processFunction以及比较速度的基准，如果您要连续执行相同的任务。

from multiprocessing import Pool
import string, random, time

def review_generator(size=10):
    chars = string.ascii_uppercase + string.digits
    return ''.join(random.choice(chars) for _ in range(size))

def processFunc(product, prodFile):
    # Return a tuple of the product name and the altered value (a list of tuples)
    return product, [[(element, review_generator()) for element in review] for review in prodFile]


if __name__ == "__main__":

    # Generate example dictionary
    dictionary = {'storeNameA': {}, 'storeNameB': {}}
    for key, _ in dictionary.items():
        for prod_i in range(1000):
            prod = f'product{prod_i}'
            dictionary[key][prod] = [[review_generator() for _ in range(50)] for _ in range(5)]

    # Time the parallel approach
    t = time.time()
    with Pool() as pool:
        a = pool.starmap(processFunc, [(product, prodFile) for product, prodFile in dictionary['storeNameA'].items()])
        b = pool.starmap(processFunc, [(product, prodFile) for product, prodFile in dictionary['storeNameB'].items()])

    print(f"Parallel Approach took {time.time() - t}")

    # Time the serial approach
    t = time.time()

    a = [processFunc(product, prodFile) for product, prodFile in dictionary['storeNameA'].items()]
    b = [processFunc(product, prodFile) for product, prodFile in dictionary['storeNameB'].items()]

    print(f"Serial approach took {time.time() - t}")

Output

Parallel Approach took 1.5318272113800049
Serial approach took 5.765411615371704

一旦从示例processFunction中为a和b中的每个商店获得结果，您就可以在主进程本身中创建新字典：

new_dictionary = {'storeNameA': {}, 'storeNameB': {}}
for product, prodFile in a:
    new_dictionary['storeNameA'][product] = prodFile
for product, prodFile in b:
    new_dictionary['storeNameB'][product] = prodFile

我还鼓励您尝试将任务分配给池提供的工作人员的不同变体（例如imap ），以查看它们是否更适合您的用例并且更有效。

Answer 2

非常感谢@Charchit 和他们的回答，我已经完成了这项工作。 它现在在~154 秒内运行我的庞大数据集，而它迭代需要 ~606 秒。

这是最终代码，与上面@Charchit 的答案非常相似，但有一些小的改动。

def processFunction(product, listOfReviews):
    # This function handles every review for each product
    toReturn = []
    for review in listOfReviews:
        X = # Do something here...
        toReturn.append(X)
        # X is now a list of tuples [(str, str), (str, str), ...]

    # toReturn is now a list of list
    return product, toReturn

if __name__ == "__main__":

original_dictionary = dict() 
# Where this would be the VERY large dictionary I have. See the structure in my original question.

    new_dictionary = dict()
    for key in original_dictionary:
        new_dictionary[key] = dict()
    for key in original_dictionary['storeNameA']:
        new_dictionary['storeNameA'][key] = list()
    for key in original_dictionary['storeNameB']:
        new_dictionary['storeNameB'][key] = list()

    with multiprocessing.Pool() as pool:
        a = pool.starmap(processFunction, [(product, reviews) for product, reviews in original_dictionary['storeNameA'].items()])
        b = pool.starmap(processFunction, [(product, reviews) for product, reviews in original_dictionary['storeNameB'].items()])

    for product, reviews in a:
        new_dictionary['storeNameA'][product] = reviews
    for product, reviews in b:
        new_dictionary['storeNameB'][product] = reviews

再次感谢@Charchit！

Python - 多处理嵌套字典

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-07-30 11:46:59

解决方案2
0 2022-07-30 15:12:23

Python - 多处理嵌套字典

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-07-30 11:46:59

解决方案2 0 2022-07-30 15:12:23

解决方案1
1 已采纳 2022-07-30 11:46:59

解决方案2
0 2022-07-30 15:12:23