[英]Python - multiprocessing a nested dictionary
我试过用这个问题来回答我的问题,但我没有任何成功。
我正在使用Python 3.10 。
我的dictionary
的结构是这样的(每个字符串列表都是对产品的评论) :
{storeNameA : {productA : 0 [string, string, ..., string]
1 [string, string, ..., string]
2 [string, string, ..., string]
...
n [string, string, ..., string],
productB : 0 [string, string, ..., string]
1 [string, string, ..., string]
2 [string, string, ..., string]
...
n [string, string, ..., string],
...,
product_n : 0 [string, string, ..., string]
1 [string, string, ..., string]
2 [string, string, ..., string]
...
n [string, string, ..., string]},
storeNameB : {productA : 0 [string, string, ..., string]
1 [string, string, ..., string]
2 [string, string, ..., string]
...
n [string, string, ..., string],
productB : 0 [string, string, ..., string]
1 [string, string, ..., string]
2 [string, string, ..., string]
...
n [string, string, ..., string],
...,
product_n : 0 [string, string, ..., string]
1 [string, string, ..., string]
2 [string, string, ..., string]
...
n [string, string, ..., string]}}
所以我会访问一个像dictionary['storeNameA']['productB'][0]
这样的“评论”
或dictionary['storeNameB']['productB'][2]
。 每个商店的每个product
都是相同的。
我正在尝试对整个词典的每条评论执行一个过程。 我可以使用以下代码以迭代方式成功执行此操作:
def mapAllValues(nestedDict, func):
return {storeName: {product: func(prodFile) for product, prodFile in storeDict.items()} for storeName, storeDict in nestedDict.items()}
new_dictionary = mapAllValues(dictionary, lambda reviews: reviews.apply(processFunction))
# processFunction takes a list of string and returns a list of tuples.
# So I end up with a new dictionary where there is now a list of tuples, where there was a list of string.
# {storeName : {product : 0 [(str, str), (str, str), ..., (str, str)] and so on...
这是一本很长的字典,大约需要 606秒才能完成。
所以,我试图实现一种并行运行它的方法,但它显然没有像我预期的那样工作,因为它运行时间约为 2170秒。 不过,我确实得到了正确的 output。
我的问题是,请问我在下面的代码中做错了什么? 任何人都可以为我提供解决此问题的方法吗?
manager = multiprocessing.Manager()
container = manager.dict()
d = manager.dict(dictionary)
container = manager.dict()
for key in d:
container[key] = manager.dict()
for key in d['storeNameA']:
container['storeNameA'][key] = manager.dict()
for key in d['storeNameB']:
container['storeNameB'][key] = manager.dict()
with multiprocessing.Pool() as pool:
pool.starmap(processFunction, [('storeNameA', product, d, container) for product in d['storeNameA']], chunksize=round(42739 / multiprocessing.cpu_count()))
pool.starmap(processFunction, [('storeNameB', product, d, container) for product in d['storeNameB']], chunksize=round(198560 / multiprocessing.cpu_count()))
new_dictionary = dict(container)
我确定我误解了这实际上是如何工作的,但是正如我所看到的,它应该将每个商店的每个产品分块并将它们平行化?
无论如何,我想我已经尽可能地解释了。 如果我需要澄清任何事情,请告诉我。
先感谢您!
首先,虽然创建管理器相对便宜,但如果您不知道它们是如何工作的,访问它们可能会变得非常昂贵。 长话短说,它们产生了一个单独的进程,并允许其他进程对存储在进程内的任何 object 执行命令。 这些命令是按顺序读取的(执行可能有点并行,因为它们在内部使用线程)。
因此,如果两个或多个进程同时尝试访问托管 object(在本例中为字典),则一个进程将阻塞,直到读取另一个进程的请求。 因此,管理器在使用多处理时并不理想(尽管非常有用),当并行进程需要定期访问托管 object 时(我假设这里是processFunction
的情况),肯定需要重新考虑一些事情。
话虽如此,在这里,您甚至不需要使用 manager 。 从外观上看, processFunction
似乎是一个本地化的 function,它不关心整个字典的 state。 因此,您应该只关心将池中的返回值从主进程本身连接到主字典中,而不是担心尝试创建共享 memory 以供池访问(请记住,池会自动通过完成后分配给主进程的任务的返回值)。
这是您可以做到这一点的一种方法,使用示例字典和processFunction
以及比较速度的基准,如果您要连续执行相同的任务。
from multiprocessing import Pool
import string, random, time
def review_generator(size=10):
chars = string.ascii_uppercase + string.digits
return ''.join(random.choice(chars) for _ in range(size))
def processFunc(product, prodFile):
# Return a tuple of the product name and the altered value (a list of tuples)
return product, [[(element, review_generator()) for element in review] for review in prodFile]
if __name__ == "__main__":
# Generate example dictionary
dictionary = {'storeNameA': {}, 'storeNameB': {}}
for key, _ in dictionary.items():
for prod_i in range(1000):
prod = f'product{prod_i}'
dictionary[key][prod] = [[review_generator() for _ in range(50)] for _ in range(5)]
# Time the parallel approach
t = time.time()
with Pool() as pool:
a = pool.starmap(processFunc, [(product, prodFile) for product, prodFile in dictionary['storeNameA'].items()])
b = pool.starmap(processFunc, [(product, prodFile) for product, prodFile in dictionary['storeNameB'].items()])
print(f"Parallel Approach took {time.time() - t}")
# Time the serial approach
t = time.time()
a = [processFunc(product, prodFile) for product, prodFile in dictionary['storeNameA'].items()]
b = [processFunc(product, prodFile) for product, prodFile in dictionary['storeNameB'].items()]
print(f"Serial approach took {time.time() - t}")
Output
Parallel Approach took 1.5318272113800049
Serial approach took 5.765411615371704
一旦从示例processFunction
中为a
和b
中的每个商店获得结果,您就可以在主进程本身中创建新字典:
new_dictionary = {'storeNameA': {}, 'storeNameB': {}}
for product, prodFile in a:
new_dictionary['storeNameA'][product] = prodFile
for product, prodFile in b:
new_dictionary['storeNameB'][product] = prodFile
我还鼓励您尝试将任务分配给池提供的工作人员的不同变体(例如imap
),以查看它们是否更适合您的用例并且更有效。
非常感谢@Charchit 和他们的回答,我已经完成了这项工作。 它现在在~154 秒内运行我的庞大数据集,而它迭代需要 ~606 秒。
这是最终代码,与上面@Charchit 的答案非常相似,但有一些小的改动。
def processFunction(product, listOfReviews):
# This function handles every review for each product
toReturn = []
for review in listOfReviews:
X = # Do something here...
toReturn.append(X)
# X is now a list of tuples [(str, str), (str, str), ...]
# toReturn is now a list of list
return product, toReturn
if __name__ == "__main__":
original_dictionary = dict()
# Where this would be the VERY large dictionary I have. See the structure in my original question.
new_dictionary = dict()
for key in original_dictionary:
new_dictionary[key] = dict()
for key in original_dictionary['storeNameA']:
new_dictionary['storeNameA'][key] = list()
for key in original_dictionary['storeNameB']:
new_dictionary['storeNameB'][key] = list()
with multiprocessing.Pool() as pool:
a = pool.starmap(processFunction, [(product, reviews) for product, reviews in original_dictionary['storeNameA'].items()])
b = pool.starmap(processFunction, [(product, reviews) for product, reviews in original_dictionary['storeNameB'].items()])
for product, reviews in a:
new_dictionary['storeNameA'][product] = reviews
for product, reviews in b:
new_dictionary['storeNameB'][product] = reviews
再次感谢@Charchit!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.