简体   繁体   English

如何通过多处理实现生产者消费者?

[英]How to implement Producer Consumer with multiprocessing?

I have a program where I need to download files from some source and upload them . 我有一个程序,需要从某些源下载文件并上传它们 But I need to make sure that there are at max 10 files in the download location. 但是我需要确保下载位置最多包含10个文件。 Is there way to use Managers() as well? 有没有办法使用Managers()?

It sounded like a typical Producer - Consumer problem. 听起来像是典型的生产者-消费者问题。 Below is my program. 下面是我的程序。

Below is my implementation 下面是我的实现

from multiprocessing import Process, Queue, Lock
import requests
import json
import shutil
import os
import time
import random
import warnings
warnings.filterwarnings("ignore")

sha_list = [line.strip() for line in open("ShaList")]


def save_file_from_sofa(sha1):
    r = requests.get("https://DOWNLOAD_URL/{}".format(sha1), verify=False, stream=True)
    with open(sha1, 'wb') as handle:
        shutil.copyfileobj(r.raw, handle)


def mock_upload():
    time.sleep(random.randint(10,16))


def producer(queue, lock):
    with lock:
        print("Starting Producer {}".format(os.getpid()))

    while sha_list:
        if not queue.full():
            sha1 = sha_list.pop()
            save_file_from_sofa(sha1)
            queue.put(sha1)


def consumer(queue, lock):
    with lock:
        print("Starting Consumer {}".format(os.getpid()))

    while True:
        sha1 = queue.get()
        mock_upload()
        with lock:
            print("{} GOT {}".format(os.getpid(), sha1))

if __name__ == "__main__":
    queue = Queue(5)
    lock = Lock()

    producers = [Process(target=producer, args=(queue, lock)) for _ in range(2)]
    consumers = []

    for _ in range(3):
        p = Process(target=consumer, args=(queue, lock))
        p.daemon = True #Do not forget to set it to true
        consumers.append(p)

    for p in producers:
        p.start()
    for c in consumers:
        c.start()

    for p in producers:
        p.join()

    print("DONE")

But It does not do what is expected, as you can see from the output below 但是它并没有达到预期的效果,如下面的输出所示

Starting Producer 623 起始生产者623

Starting Producer 624 起始生产者624

Starting Consumer 626 起始消费者626

Starting Consumer 625 起始消费者625

Starting Consumer 627 起始消费者627

626 GOT 4ff551490d6b2eec7c6c0470f4b092fdc34cd521 626 GOT 4ff551490d6b2eec7c6c0470f4b092fdc34cd521

625 GOT 83a53a3400fc83f2b02135ba0cc6c8625ecc7dc4 625 GOT 83a53a3400fc83f2b02135ba0cc6c8625ecc7dc4

627 GOT 4ff551490d6b2eec7c6c0470f4b092fdc34cd521 627 GOT 4ff551490d6b2eec7c6c0470f4b092fdc34cd521

626 GOT 83a53a3400fc83f2b02135ba0cc6c8625ecc7dc4 626 GOT 83a53a3400fc83f2b02135ba0cc6c8625ecc7dc4

625 GOT 4e7132301ce9d61445db07910ff90a64474e6a88 625 GOT 4e7132301ce9d61445db07910ff90a64474e6a88

626 GOT 0efbd413d733b3903e6dee777ace5ef47a2ec144 626 GOT 0efbd413d733b3903e6dee777ace5ef47a2ec144

627 GOT 4e7132301ce9d61445db07910ff90a64474e6a88 627 GOT 4e7132301ce9d61445db07910ff90a64474e6a88

625 GOT 0efbd413d733b3903e6dee777ace5ef47a2ec144 625 GOT 0efbd413d733b3903e6dee777ace5ef47a2ec144

626 GOT 0a3fc4bdd56fa2bf52f5f43277f3b4ee0f040937 626 GOT 0a3fc4bdd56fa2bf52f5f43277f3b4ee0f040937

625 GOT eb9c07329a8b5cb66e47f0dd8e56894707a84d94 625 GOT eb9c07329a8b5cb66e47f0dd8e56894707a84d94

627 GOT 0a3fc4bdd56fa2bf52f5f43277f3b4ee0f040937 627 GOT 0a3fc4bdd56fa2bf52f5f43277f3b4ee0f040937

626 GOT eb9c07329a8b5cb66e47f0dd8e56894707a84d94 626 GOT eb9c07329a8b5cb66e47f0dd8e56894707a84d94

DONE 完成

As you can see consumer picks up same SHA1s multiple times. 如您所见,消费者多次选择相同的SHA1。 So, I need a program to make sure that all the SHA1s put in the queue by producer is picked up by only 1 consumer. 因此,我需要一个程序来确保生产者放入队列中的所有SHA1只能由1个消费者使用。

PS I had also thought to make it work using pool. PS我也曾想过使它使用池。 For producer it can work fine as I already have list of SHA1s to be put in the queue, But in case of consumer how would I use any list to make sure that consumer is actually stopping. 对于生产者,它可以正常工作,因为我已经将SHA1列表放入队列中,但是对于使用方,我将如何使用任何列表来确保使用方确实停止了。

Just use a pool from either multiprocessing.Pool or concurrent.futures . 只需使用来自multiprocessing.Poolconcurrent.futures的池。 The pool allows you to set how many workers you want running at the same time. 该池允许您设置要同时运行的工作线程数。 This means you will have maximum max_workers files downloaded at the same time. 这意味着您将同时下载最多max_workers文件。

As the download/upload is sequential (you cannot start an upload until the download is complete), you gain no value from running them in two separated threads/processes. 由于下载/上载是顺序的(在下载完成之前无法开始上载),因此在两个单独的线程/进程中运行它们不会获得任何价值。 Just join the two operations in a single job unit and then run multiple jobs concurrently. 只需将两个操作合并到一个作业单元中,然后同时运行多个作业即可。

Moreover, as long as you just need to download/upload files (IO bound operations) you'd better use threads instead of processes as they are more lightweight. 而且,只要您只需要下载/上传文件(IO绑定操作),您最好使用线程而不是进程,因为它们更轻量。

from concurrent.futures import ThreadPoolExecutor

list_of_sha1s = ['foobar', 'foobaz']

def worker(sha1):
    path = save_file_from_sofa(sha1)
    upload_file(path)

    return sha1

with ThreadPoolExecutor(max_workers=10) as pool:
    for sha1 in pool.map(worker, list_of_sha1s):
        print("Done SHA1: %s" % sha1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM