带返回值的多处理

Question

我有一个问题，将多处理以加快对存储在 S3 上需要检查的文件的一些处理。 因为我现在是使用多处理的新手，所以我不确定当我只使用 for 循环时，代码运行时没有发布到底有什么问题。

def read_json(file):
  file_key = file["Key"]
  file_key_split = file_key.split("/")
  document = get_json_details(file_key)
  type = file_key_split[2]  
return document, type

document_list = []
document_type_list = []

mgr = mp.Manager()
nodes = mgr.list()
pool_size = mp.cpu_count()
pool = mp.Pool(processes=pool_size)
# mp.freeze_support()

for file in tqdm(get_all_s3_objects(s3, Bucket=docbucket, Prefix=prefix)):
    document_list, document_type_list = zip(*pool.map(read_json, file))

pool.close()
pool.join()

我得到的错误如下：

"""
Traceback (most recent call last):
  File "C:\Users\tobia\AppData\Local\Programs\Python\Python38\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\tobia\AppData\Local\Programs\Python\Python38\lib\multiprocessing\pool.py", line 48, in mapstar
    return list(map(*args))
  File "c:\GIT\BMWJPSI-BI\03_Lambda_Functions\RegoOCRCheck.py", line 118, in read_json
    file_key = file["Key"]
TypeError: string indices must be integers
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:/GIT/BMWJPSI-BI/03_Lambda_Functions/RegoOCRCheck.py", line 151, in <module>
    document_list, document_type_list = zip(pool.map(read_json, file))
  File "C:\Users\tobia\AppData\Local\Programs\Python\Python38\lib\multiprocessing\pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\tobia\AppData\Local\Programs\Python\Python38\lib\multiprocessing\pool.py", line 771, in get
    raise self._value
TypeError: string indices must be integers```

Thanks for your help.

Answer 1

Sorry for the delayed response, I think the issue you're having is that you're passing a dictionary object into the pool.map function, which will only iterating through the keys of the dictionary instead of passing the dictionary object itself. I think instead of iterating through each individual file and run the pool.map , you should try passing the entire get_all_s3_objects(s3, Bucket=docbucket, Prefix=prefix) into the pool.map function which will be iterated and return as list of tuples其中每个元组是(document_list, document_type_list)

document_list, document_type_list = zip(*pool.map(read_json, get_all_s3_objects(s3, Bucket=docbucket, Prefix=prefix)))

如果您仍然遇到任何问题，请告诉我

带返回值的多处理

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-05-02 13:44:49

带返回值的多处理

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-05-02 13:44:49

解决方案1
0 已采纳 2021-05-02 13:44:49