繁体   English   中英

Python - 多处理嵌套字典

[英]Python - multiprocessing a nested dictionary

我试过用这个问题来回答我的问题,但我没有任何成功。

我正在使用Python 3.10

我的dictionary的结构是这样的(每个字符串列表都是对产品的评论)

{storeNameA : {productA : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string], 
               productB : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string],
               ...,
               product_n : 0 [string, string, ..., string]
                           1 [string, string, ..., string]
                           2 [string, string, ..., string]
                           ...
                           n [string, string, ..., string]},
 storeNameB : {productA : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string], 
               productB : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string],
               ...,
               product_n : 0 [string, string, ..., string]
                           1 [string, string, ..., string]
                           2 [string, string, ..., string]
                           ...
                           n [string, string, ..., string]}}

所以我会访问一个像dictionary['storeNameA']['productB'][0]这样的“评论”
dictionary['storeNameB']['productB'][2] 每个商店的每个product都是相同的。

我正在尝试对整个词典的每条评论执行一个过程。 我可以使用以下代码以迭代方式成功执行此操作:

def mapAllValues(nestedDict, func):
    return {storeName: {product: func(prodFile) for product, prodFile in storeDict.items()} for storeName, storeDict in nestedDict.items()}

new_dictionary = mapAllValues(dictionary, lambda reviews: reviews.apply(processFunction))
# processFunction takes a list of string and returns a list of tuples.
# So I end up with a new dictionary where there is now a list of tuples, where there was a list of string.
# {storeName : {product : 0 [(str, str), (str, str), ..., (str, str)]    and so on...

这是一本很长的字典,大约需要 606秒才能完成。
所以,我试图实现一种并行运行它的方法,但它显然没有像我预期的那样工作,因为它运行时间约为 2170秒。 不过,我确实得到了正确的 output。

我的问题是,请问我在下面的代码中做错了什么? 任何人都可以为我提供解决此问题的方法吗?

manager = multiprocessing.Manager()
container = manager.dict()
    d = manager.dict(dictionary)
    container = manager.dict()
    for key in d:
        container[key] = manager.dict()
    for key in d['storeNameA']:
        container['storeNameA'][key] = manager.dict()
    for key in d['storeNameB']:
        container['storeNameB'][key] = manager.dict()
    
    with multiprocessing.Pool() as pool:
        pool.starmap(processFunction, [('storeNameA', product, d, container) for product in d['storeNameA']], chunksize=round(42739 / multiprocessing.cpu_count()))
        pool.starmap(processFunction, [('storeNameB', product, d, container) for product in d['storeNameB']], chunksize=round(198560 / multiprocessing.cpu_count()))
    
new_dictionary = dict(container)

我确定我误解了这实际上是如何工作的,但是正如我所看到的,它应该将每个商店的每个产品分块并将它们平行化?

无论如何,我想我已经尽可能地解释了。 如果我需要澄清任何事情,请告诉我。
先感谢您!

首先,虽然创建管理器相对便宜,但如果您不知道它们是如何工作的,访问它们可能会变得非常昂贵。 长话短说,它们产生了一个单独的进程,并允许其他进程对存储在进程内的任何 object 执行命令。 这些命令是按顺序读取的(执行可能有点并行,因为它们在内部使用线程)。

因此,如果两个或多个进程同时尝试访问托管 object(在本例中为字典),则一个进程将阻塞,直到读取另一个进程的请求。 因此,管理器在使用多处理时并不理想(尽管非常有用),当并行进程需要定期访问托管 object 时(我假设这里是processFunction的情况),肯定需要重新考虑一些事情。

话虽如此,在这里,您甚至不需要使用 manager 从外观上看, processFunction似乎是一个本地化的 function,它不关心整个字典的 state。 因此,您应该只关心将池中的返回值从主进程本身连接到主字典中,而不是担心尝试创建共享 memory 以供池访问(请记住,池会自动通过完成后分配给主进程的任务的返回值)。

这是您可以做到这一点的一种方法,使用示例字典和processFunction以及比较速度的基准,如果您要连续执行相同的任务。

from multiprocessing import Pool
import string, random, time

def review_generator(size=10):
    chars = string.ascii_uppercase + string.digits
    return ''.join(random.choice(chars) for _ in range(size))

def processFunc(product, prodFile):
    # Return a tuple of the product name and the altered value (a list of tuples)
    return product, [[(element, review_generator()) for element in review] for review in prodFile]


if __name__ == "__main__":

    # Generate example dictionary
    dictionary = {'storeNameA': {}, 'storeNameB': {}}
    for key, _ in dictionary.items():
        for prod_i in range(1000):
            prod = f'product{prod_i}'
            dictionary[key][prod] = [[review_generator() for _ in range(50)] for _ in range(5)]

    # Time the parallel approach
    t = time.time()
    with Pool() as pool:
        a = pool.starmap(processFunc, [(product, prodFile) for product, prodFile in dictionary['storeNameA'].items()])
        b = pool.starmap(processFunc, [(product, prodFile) for product, prodFile in dictionary['storeNameB'].items()])

    print(f"Parallel Approach took {time.time() - t}")

    # Time the serial approach
    t = time.time()

    a = [processFunc(product, prodFile) for product, prodFile in dictionary['storeNameA'].items()]
    b = [processFunc(product, prodFile) for product, prodFile in dictionary['storeNameB'].items()]

    print(f"Serial approach took {time.time() - t}")

Output

Parallel Approach took 1.5318272113800049
Serial approach took 5.765411615371704

一旦从示例processFunction中为ab中的每个商店获得结果,您就可以在主进程本身中创建新字典:

new_dictionary = {'storeNameA': {}, 'storeNameB': {}}
for product, prodFile in a:
    new_dictionary['storeNameA'][product] = prodFile
for product, prodFile in b:
    new_dictionary['storeNameB'][product] = prodFile

我还鼓励您尝试将任务分配给池提供的工作人员的不同变体(例如imap ),以查看它们是否更适合您的用例并且更有效。

非常感谢@Charchit 和他们的回答,我已经完成了这项工作。 它现在在~154 秒内运行我的庞大数据集,而它迭代需要 ~606 秒。

这是最终代码,与上面@Charchit 的答案非常相似,但有一些的改动。

def processFunction(product, listOfReviews):
    # This function handles every review for each product
    toReturn = []
    for review in listOfReviews:
        X = # Do something here...
        toReturn.append(X)
        # X is now a list of tuples [(str, str), (str, str), ...]

    # toReturn is now a list of list
    return product, toReturn

if __name__ == "__main__":

original_dictionary = dict() 
# Where this would be the VERY large dictionary I have. See the structure in my original question.

    new_dictionary = dict()
    for key in original_dictionary:
        new_dictionary[key] = dict()
    for key in original_dictionary['storeNameA']:
        new_dictionary['storeNameA'][key] = list()
    for key in original_dictionary['storeNameB']:
        new_dictionary['storeNameB'][key] = list()

    with multiprocessing.Pool() as pool:
        a = pool.starmap(processFunction, [(product, reviews) for product, reviews in original_dictionary['storeNameA'].items()])
        b = pool.starmap(processFunction, [(product, reviews) for product, reviews in original_dictionary['storeNameB'].items()])

    for product, reviews in a:
        new_dictionary['storeNameA'][product] = reviews
    for product, reviews in b:
        new_dictionary['storeNameB'][product] = reviews

再次感谢@Charchit!

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM