简体   繁体   English

Python - 多处理嵌套字典

[英]Python - multiprocessing a nested dictionary

I have tried using this question to answer my problem, but I haven't had any success.我试过用这个问题来回答我的问题,但我没有任何成功。

I'm using Python 3.10 .我正在使用Python 3.10

My dictionary is structured like this (where each list of string is a review of the product) :我的dictionary的结构是这样的(每个字符串列表都是对产品的评论)

{storeNameA : {productA : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string], 
               productB : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string],
               ...,
               product_n : 0 [string, string, ..., string]
                           1 [string, string, ..., string]
                           2 [string, string, ..., string]
                           ...
                           n [string, string, ..., string]},
 storeNameB : {productA : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string], 
               productB : 0 [string, string, ..., string]
                          1 [string, string, ..., string]
                          2 [string, string, ..., string]
                          ...
                          n [string, string, ..., string],
               ...,
               product_n : 0 [string, string, ..., string]
                           1 [string, string, ..., string]
                           2 [string, string, ..., string]
                           ...
                           n [string, string, ..., string]}}

So I would access a single 'review' like dictionary['storeNameA']['productB'][0]所以我会访问一个像dictionary['storeNameA']['productB'][0]这样的“评论”
or dictionary['storeNameB']['productB'][2] .dictionary['storeNameB']['productB'][2] Each product is the same in each store.每个商店的每个product都是相同的。

I am trying to perform a process on each review across the entire dictionary.我正在尝试对整个词典的每条评论执行一个过程。 I can perform this sucessfuly in an iterative manner with this code:我可以使用以下代码以迭代方式成功执行此操作:

def mapAllValues(nestedDict, func):
    return {storeName: {product: func(prodFile) for product, prodFile in storeDict.items()} for storeName, storeDict in nestedDict.items()}

new_dictionary = mapAllValues(dictionary, lambda reviews: reviews.apply(processFunction))
# processFunction takes a list of string and returns a list of tuples.
# So I end up with a new dictionary where there is now a list of tuples, where there was a list of string.
# {storeName : {product : 0 [(str, str), (str, str), ..., (str, str)]    and so on...

It's a pretty long dictionary, and takes ~606 seconds to complete.这是一本很长的字典,大约需要 606秒才能完成。
So, I have tried to implement a way to run this in parallel, but it's obviously not working as I expect it to because that runs in ~2170 seconds.所以,我试图实现一种并行运行它的方法,但它显然没有像我预期的那样工作,因为它运行时间约为 2170秒。 I do get the right output though.不过,我确实得到了正确的 output。

My question is, what am I doing wrong in the following code please?我的问题是,请问我在下面的代码中做错了什么? Can anyone provide me a solution to this problem?任何人都可以为我提供解决此问题的方法吗?

manager = multiprocessing.Manager()
container = manager.dict()
    d = manager.dict(dictionary)
    container = manager.dict()
    for key in d:
        container[key] = manager.dict()
    for key in d['storeNameA']:
        container['storeNameA'][key] = manager.dict()
    for key in d['storeNameB']:
        container['storeNameB'][key] = manager.dict()
    
    with multiprocessing.Pool() as pool:
        pool.starmap(processFunction, [('storeNameA', product, d, container) for product in d['storeNameA']], chunksize=round(42739 / multiprocessing.cpu_count()))
        pool.starmap(processFunction, [('storeNameB', product, d, container) for product in d['storeNameB']], chunksize=round(198560 / multiprocessing.cpu_count()))
    
new_dictionary = dict(container)

I'm sure I'm misunderstanding how this is actually working, but as I see it it should be chunking each product from each store and parellising those?我确定我误解了这实际上是如何工作的,但是正如我所看到的,它应该将每个商店的每个产品分块并将它们平行化?

Anyway, I think I've explained it as well as I can.无论如何,我想我已经尽可能地解释了。 If I need to clarify anything, please let me know.如果我需要澄清任何事情,请告诉我。
Thank you in advance!先感谢您!

First of all, while creating managers is relatively cheap, accessing them can become quite expensive if you don't know how they work.首先,虽然创建管理器相对便宜,但如果您不知道它们是如何工作的,访问它们可能会变得非常昂贵。 Long story short, they spawn a separate process, and allow other processes to execute commands on any object stored inside the process.长话短说,它们产生了一个单独的进程,并允许其他进程对存储在进程内的任何 object 执行命令。 These commands are read sequentially (execution can be somewhat parallel since they use threading internally).这些命令是按顺序读取的(执行可能有点并行,因为它们在内部使用线程)。

Therefore, if two or more processes attempt to access a managed object (a dictionary in this case) at the same time, one will block until the other process's request is read.因此,如果两个或多个进程同时尝试访问托管 object(在本例中为字典),则一个进程将阻塞,直到读取另一个进程的请求。 Therefore, managers are non-ideal when using multiprocessing (although very useful nonetheless), and definitely something to be reconsidered when the parallel processes need to regularly access the managed object (which I assume is the case here with processFunction ).因此,管理器在使用多处理时并不理想(尽管非常有用),当并行进程需要定期访问托管 object 时(我假设这里是processFunction的情况),肯定需要重新考虑一些事情。

With that said, here, you do not even need to use managers .话虽如此,在这里,您甚至不需要使用 manager From the looks of it, processFunction seems like a localized function which does not care about the state of the dictionary as a whole.从外观上看, processFunction似乎是一个本地化的 function,它不关心整个字典的 state。 Therefore, you should only concern yourself with concatenating the return values from the pool into your main dictionary from within the main process itself, rather then worrying about trying to create shared memory for the pool to have access to (remember that a pool automatically passes the return value of the tasks it is assigned to the main process upon completion).因此,您应该只关心将池中的返回值从主进程本身连接到主字典中,而不是担心尝试创建共享 memory 以供池访问(请记住,池会自动通过完成后分配给主进程的任务的返回值)。

Here's a way you can do that, with a sample dictionary and processFunction , along with a benchmark comparing the speed if you were to do the same task serially.这是您可以做到这一点的一种方法,使用示例字典和processFunction以及比较速度的基准,如果您要连续执行相同的任务。

from multiprocessing import Pool
import string, random, time

def review_generator(size=10):
    chars = string.ascii_uppercase + string.digits
    return ''.join(random.choice(chars) for _ in range(size))

def processFunc(product, prodFile):
    # Return a tuple of the product name and the altered value (a list of tuples)
    return product, [[(element, review_generator()) for element in review] for review in prodFile]


if __name__ == "__main__":

    # Generate example dictionary
    dictionary = {'storeNameA': {}, 'storeNameB': {}}
    for key, _ in dictionary.items():
        for prod_i in range(1000):
            prod = f'product{prod_i}'
            dictionary[key][prod] = [[review_generator() for _ in range(50)] for _ in range(5)]

    # Time the parallel approach
    t = time.time()
    with Pool() as pool:
        a = pool.starmap(processFunc, [(product, prodFile) for product, prodFile in dictionary['storeNameA'].items()])
        b = pool.starmap(processFunc, [(product, prodFile) for product, prodFile in dictionary['storeNameB'].items()])

    print(f"Parallel Approach took {time.time() - t}")

    # Time the serial approach
    t = time.time()

    a = [processFunc(product, prodFile) for product, prodFile in dictionary['storeNameA'].items()]
    b = [processFunc(product, prodFile) for product, prodFile in dictionary['storeNameB'].items()]

    print(f"Serial approach took {time.time() - t}")

Output Output

Parallel Approach took 1.5318272113800049
Serial approach took 5.765411615371704

Once you have the results from the sample processFunction for each store inside a and b , you can then create your new dictionary in the main process itself:一旦从示例processFunction中为ab中的每个商店获得结果,您就可以在主进程本身中创建新字典:

new_dictionary = {'storeNameA': {}, 'storeNameB': {}}
for product, prodFile in a:
    new_dictionary['storeNameA'][product] = prodFile
for product, prodFile in b:
    new_dictionary['storeNameB'][product] = prodFile

I would also encourage you to try different variants of assigning tasks to workers a pool offers, (like imap ) to see if they fit your use-case better and are more efficient.我还鼓励您尝试将任务分配给池提供的工作人员的不同变体(例如imap ),以查看它们是否更适合您的用例并且更有效。

With massive thanks for @Charchit and their answer, I have got this working.非常感谢@Charchit 和他们的回答,我已经完成了这项工作。 And it is now running my huge dataset in ~154 seconds , compared to the ~606 seconds it was taking iteratively.它现在在~154 秒内运行我的庞大数据集,而它迭代需要 ~606 秒。

Here's the final code, which is very similar to @Charchit's answer above, but with some small changes.这是最终代码,与上面@Charchit 的答案非常相似,但有一些的改动。

def processFunction(product, listOfReviews):
    # This function handles every review for each product
    toReturn = []
    for review in listOfReviews:
        X = # Do something here...
        toReturn.append(X)
        # X is now a list of tuples [(str, str), (str, str), ...]

    # toReturn is now a list of list
    return product, toReturn

if __name__ == "__main__":

original_dictionary = dict() 
# Where this would be the VERY large dictionary I have. See the structure in my original question.

    new_dictionary = dict()
    for key in original_dictionary:
        new_dictionary[key] = dict()
    for key in original_dictionary['storeNameA']:
        new_dictionary['storeNameA'][key] = list()
    for key in original_dictionary['storeNameB']:
        new_dictionary['storeNameB'][key] = list()

    with multiprocessing.Pool() as pool:
        a = pool.starmap(processFunction, [(product, reviews) for product, reviews in original_dictionary['storeNameA'].items()])
        b = pool.starmap(processFunction, [(product, reviews) for product, reviews in original_dictionary['storeNameB'].items()])

    for product, reviews in a:
        new_dictionary['storeNameA'][product] = reviews
    for product, reviews in b:
        new_dictionary['storeNameB'][product] = reviews

Thanks again, @Charchit!再次感谢@Charchit!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM