在 Python 脚本中使用多处理

Question

I am trying to label multiple images by brand -> product -> each product image.我正在尝试按品牌 -> 产品 -> 每个产品图像 label 多个图像。 Since it takes a bit of time to label each image one at a time, I decided to use multiprocessing to speed up the job.由于每次处理每个图像都需要一些时间，因此我决定使用多处理来加快工作速度。 I tried using multiprocessing, it definitely speeds up labeling the images, but the code doesn't work how I intended it to.我尝试使用多处理，它肯定会加快标记图像的速度，但代码无法按我的预期工作。

Code:代码：

def multiprocessing_func(line):
    json_line = json.loads(line)
    product = json_line['groupid']
    active_urls = set(json_line['urls'])

    try:
        active_urls.remove(brand_dic[brand])
    except:
        pass

    if product in saved_product_dict and active_urls == saved_product_dict[product]:
        keep_products.append(product)
        print('True')
    else:
        with open(new_images_filename, 'a') as save_file:
            labels = label_product_images(line)
            save_file.write('{}\n'.format(json.dumps(labels)))
        print('False')


    active_images_filename = 'data/input/image_urls.json'
    new_images_filename = 'data/output/new_labeled_images.json'
    saved_images_filename = 'data/output/saved_labeled_images.json'
    
    brand_dic = {'a': 'https://www.a.com/imgs/ab/images/dp/m.jpg',
                 'b': 'https://www.b.com/imgs/ab/images/wcm/m.jpg',
                 'c': 'https://www.c.com/imgs/ab/images/dp/m.jpg',}
    
    if __name__ == '__main__':
        brands = ['a', 'b', 'c']
        for brand in brands:
            active_images_filename = 'data/input/brands/' + brand + '/image_urls.json'
            new_images_filename = 'data/output/brands/' + brand + '/new_labeled_images.json'
            saved_images_filename = 'data/output/brands/' + brand + '/saved_labeled_images.json'
    
            print(new_images_filename)
            with open(new_images_filename, 'w'): pass
    
    
            saved_product_dict = {}
            with open(saved_images_filename) as in_file:
                for line in in_file:
                    json_line = json.loads(line)
                    saved_urls = [url for urls_list in json_line['urls'] for url in urls_list]
                    saved_product_dict[json_line['groupid']] = set(saved_urls)
    
    
            print(saved_product_dict)
            keep_products = []
            labels_list = []
            with open(active_images_filename, 'r') as in_file:
                processes = []
                for line in in_file:
                    p = multiprocessing.Process(target=multiprocessing_func, args=(line,))
                    processes.append(p)
                    p.start()
    
            print('complete stage 1')
    
        for i in range(0,2):
            print('running stage 2')

Output: Output：

data/output/brands/mg/new_labeled_images.json
{}
complete stage 1
running stage 2
running stage 2
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202025/0011/terminal-1-soft-sided-carry-on-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202025/0011/terminal-1-soft-sided-carry-on-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202010/0027/anchor-hope-and-protect-necklace-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202007/0003/patterned-folded-notecards-set-of-25-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202005/0003/patterned-folded-notecards-set-of-25-t.jpg
silo : https://a/mgimgs/rk/images/dp/wcm/202007/0002/patterned-folded-notecards-set-of-25-1-m.jpg
unmatched : https://www.a.com/mgimgs/rk/images/dp/a/202010/0013.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/a/202007/0002.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/a/202007/0003.jpg
False
unmatched : https://www.a.com/mgimgs/rk/images/dp/a/202010/0022.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202019/454.jpg
False
lifestyle - Lif1 : https://a.com/mgimgs/rk/images/dp/wcm/202025/0011.jpg
False
False

I noticed that the multiprocessing step runs last and skips codes, and I'm not sure why it does this.我注意到多处理步骤最后运行并跳过代码，我不确定它为什么这样做。 Also I'm not sure why it didn't run the first part, when I tried printing "saved_product_dict", the dictionary came up empty.另外我不确定为什么它没有运行第一部分，当我尝试打印“saved_product_dict”时，字典出现了空。

I have code before and after the multiprocessing step that run before it.我在它之前运行的多处理步骤之前和之后都有代码。 My question is how to I force the multiprocessing step to run in the order that I have written my code.我的问题是如何强制多处理步骤按照我编写代码的顺序运行。 Any explanation on what's going would be greatly appreciated.任何关于发生了什么的解释将不胜感激。 I'm new to using multiprocessing, I'm still learning how it works.我是使用多处理的新手，我仍在学习它是如何工作的。

Answer 1

This line seems to be wrong.这条线似乎是错误的。 Try to change it尝试改变它

saved_urls = [url for urls_list in json_line['urls'] for url in urls_list]

with:和：

saved_urls = [url for urls_list in json_line['urls]]

This might be the solution for the first part of your question.这可能是您问题第一部分的解决方案。

About printing of the multiprocessing part and the main thread of the program.关于多处理部分和程序主线程的打印。 The print order does not always a correct indicator of the run time of the functions/scripts in async environments(here different processes exists).打印顺序并不总是正确指示异步环境中函数/脚本的运行时间（这里存在不同的进程）。 If you want to run your scripts in a defined order you need to implement synchronization mechanism using semaphores and mutexes, or you wait for all processes to exit before moving to stage 2, which was the main concern of you i assume.如果您想以定义的顺序运行脚本，您需要使用信号量和互斥锁实现同步机制，或者您等待所有进程退出，然后再进入第 2 阶段，我认为这是您的主要关注点。

在 Python 脚本中使用多处理

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-07-22 18:06:51

在 Python 脚本中使用多处理

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-07-22 18:06:51

解决方案1
1 已采纳 2020-07-22 18:06:51