简体   繁体   English

如何在请求模块中使用多处理?

[英]How to use multiprocessing with requests module?

I am new dev in python. 我是python的新开发者。 My code is code below: 我的代码是以下代码:

import warnings
import requests
import multiprocessing

from colorama import init
init(autoreset=True)

from requests.packages.urllib3.exceptions import InsecureRequestWarning
warnings.simplefilter("ignore", UserWarning)
warnings.simplefilter('ignore', InsecureRequestWarning)

from bs4 import BeautifulSoup as BS

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}


class Worker(multiprocessing.Process):

    def run(self):
        with open('ips.txt', 'r') as urls:
            for url in urls.readlines():
                req = url.strip()
                try:
                    page = requests.get(req, headers=headers, verify=False, allow_redirects=False, stream=True,
                                        timeout=10)
                    soup = BS(page.text)
                    # string = string.encode('ascii', 'ignore')
                    print('\033[32m' + req + ' - Title: ', soup.title)
                except requests.RequestException as e:
                    print('\033[32m' + req + ' - TimeOut!')
        return


if __name__ == '__main__':
    jobs = []
    for i in range(5):
        p = Worker()
        jobs.append(p)
        p.start()
    for j in jobs:
        j.join()

I am trying to make the program read IPs.txt and print out the title of each website. 我试图让程序读取IPs.txt并打印出每个网站的标题。

It works flawlessly in a single thread. 它在单个线程中完美运行。 Now I want to make it faster by using multiprocessing . 现在我希望通过使用multiprocessing来加快速度。

But it just outputs the same line 5 times for some reason. 但由于某种原因它只输出相同的5次线。 I am new with multiprocessing and tried my best with failed attempts. 我是多处理的新手,并尝试了失败的尝试。

Screen shot showing problem: 屏幕截图显示问题:

屏幕截图显示问题

I just want to run 5 workers to check the IPs.txt in multithreading or parallel...I just want to make it faster. 我只想运行5个工作程序来检查多线程或并行中的IPs.txt ...我只是想让它更快。

Any hint, clue, help? 任何提示,线索,帮助?

Issue 问题

The primary issue in your code is that each Worker opens ips.txt from scratch and works on each URL found in ips.txt . 代码中的主要问题是每个Worker从头开始打开ips.txt并处理ips.txt找到的每个URL。 Thus the five workers together open ips.txt five times and work on each URL five times. 因此,五名工作人员一起打开ips.txt五次并在每个URL上工作五次。

Solution

The right way to solve this problem is to split the code into master and worker . 解决此问题的正确方法是将代码拆分为masterworker You already have most of the worker code implemented. 您已经实现了大部分工作程序代码。 Let us treat the main section (under if __name__ == '__main__': ) as the master for now. 让我们将主要部分(在if __name__ == '__main__':下)视为现在的主人。

Now the master is supposed to launch five workers and send work to them via a queue ( multiprocessing.Queue ). 现在主人应该启动五个工作人员并通过队列( multiprocessing.Queue )向他们发送工作。

The multiprocessing.Queue class offers a way for multiple producers to put data into it and multiple consumers to read data from it without running into race conditions. multiprocessing.Queue类为多个生产者提供了一种将数据放入其中的方法,以及多个消费者从中读取数据而不会遇到竞争条件。 This class implements all the necessary locking semantics to exchange data safely in a multiprocessing context and prevent race conditions. 此类实现了所有必要的锁定语义,以便在多处理上下文中安全地交换数据并防止竞争条件。

Fixed Code 固定代码

Here is how your code could be rewritten as per what I've described above: 以下是根据我上面描述的内容重写代码的方法:

import warnings
import requests
import multiprocessing

from colorama import init
init(autoreset=True)

from requests.packages.urllib3.exceptions import InsecureRequestWarning
warnings.simplefilter("ignore", UserWarning)
warnings.simplefilter('ignore', InsecureRequestWarning)

from bs4 import BeautifulSoup as BS

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}


class Worker(multiprocessing.Process):

    def __init__(self, job_queue):
        super().__init__()
        self._job_queue = job_queue

    def run(self):
        while True:
            url = self._job_queue.get()
            if url is None:
                break

            req = url.strip()

            try:
                page = requests.get(req, headers=headers, verify=False, allow_redirects=False, stream=True,
                                    timeout=10)
                soup = BS(page.text)
                # string = string.encode('ascii', 'ignore')
                print('\033[32m' + req + ' - Title: ', soup.title)
            except requests.RequestException as e:
                print('\033[32m' + req + ' - TimeOut!')


if __name__ == '__main__':
    jobs = []
    job_queue = multiprocessing.Queue()

    for i in range(5):
        p = Worker(job_queue)
        jobs.append(p)
        p.start()

    # This is the master code that feeds URLs into queue.
    with open('ips.txt', 'r') as urls:
        for url in urls.readlines():
            job_queue.put(url)

    # Send None for each worker to check and quit.
    for j in jobs:
        job_queue.put(None)

    for j in jobs:
        j.join()

We can see in the above code that the master opens ips.txt once, reads the URLs from it one by one and puts them into the queue. 我们可以在上面的代码中看到master打开ips.txt一次,从中ips.txt读取URL并将它们放入队列中。 Each worker waits for a URL to arrive on this queue. 每个工作程序都等待URL到达此队列。 As soon as a URL arrives on the queue, one of the workers picks it up and gets busy. 一旦URL到达队列,其中一个工作人员就会把它拿起并忙碌起来。 If there are more URLs in the queue, the next free worker picks the next one up and so on. 如果队列中有更多URL,则下一个免费工作者会选择下一个,依此类推。

Finally, we need some way for the workers to quit when all work is done. 最后,我们需要一些方法让工人在完成所有工作后退出。 There are several ways to achieve this. 有几种方法可以实现这一目标。 In this example, I have chosen a simple strategy of sending five sentinel values (five None values in this case) into the queue, one for each worker, so that each worker can pick this up and quit. 在这个例子中,我选择了一个简单的策略,即将五个sentinel值(在这种情况下为五个None值)发送到队列中,每个worker一个,这样每个worker都可以选择并退出。

There is another strategy where the workers and the master share a multiprocessing.Event object just like they share a multiprocessing.Queue object right now. 还有另一种策略,即worker和master共享一个multiprocessing.Event对象,就像他们现在共享一个multiprocessing.Queue对象一样。 The master invokes the set() method of this object whenever it wants the workers to quit. 只要希望工作者退出,master就会调用此对象的set()方法。 The workers check if this object is_set() and quit. 工作人员检查此对象是否为is_set()并退出。 However, this introduces some additional complexity into the code. 但是,这会在代码中引入一些额外的复杂性。 I've discussed this below. 我在下面讨论过这个问题。

For the sake of completeness and also for the sake of demonstrating minimal, complete, and verifiable examples, I am presenting two code examples below that show both stopping strategies. 为了完整性以及为了展示最小,完整和可验证的示例,我将在下面提供两个代码示例,其中显示了两种停止策略。

Using Sentinel Value to Stop Workers 使用Sentinel值来阻止工人

This is pretty much what I have described above so far except that the code example has been simplified a lot to remove dependencies on any libraries outside the Python standard library. 这几乎是我到目前为止所描述的内容,除了代码示例已被简化很多以消除Python标准库之外的任何库的依赖性。

Another thing worth noting in the example below is that instead of creating a worker class, we use a worker function and create a Process out of it. 在下面的示例中值得注意的另一件事是,我们使用worker函数并创建一个Process ,而不是创建一个worker类。 This type of code is often found in the Python documentation and it is quite idiomatic. 这种类型的代码经常出现在Python文档中,它非常惯用。

import multiprocessing
import time
import random


def worker(input_queue):
    while True:
        url = input_queue.get()

        if url is None:
            break

        print('Started working on:', url)

        # Random delay to simulate fake processing.
        time.sleep(random.randint(1, 3))

        print('Stopped working on:', url)


def master():
    urls = [
        'https://example.com/',
        'https://example.org/',
        'https://example.net/',
        'https://stackoverflow.com/',
        'https://www.python.org/',
        'https://github.com/',
        'https://susam.in/',
    ]

    input_queue = multiprocessing.Queue()
    workers = []

    # Create workers.
    for i in range(5):
        p = multiprocessing.Process(target=worker, args=(input_queue, ))
        workers.append(p)
        p.start()

    # Distribute work.
    for url in urls:
        input_queue.put(url)

    # Ask the workers to quit.
    for w in workers:
        input_queue.put(None)

    # Wait for workers to quit.
    for w in workers:
        w.join()

    print('Done')


if __name__ == '__main__':
    master()

Using Event to Stop Workers 使用事件来阻止工人

Using an multiprocessing.Event object to signal when workers should quit introduces some complexity in the code. 使用multiprocessing.Event对象来指示工作者何时应该退出会在代码中引入一些复杂性。 There are primarily three changes that have to be made: 主要有三个变化:

  • In the master, we invoke the set() method on the Event object to signal that workers should quit as soon as possible. 在master中,我们在Event对象上调用set()方法来表示工作者应该尽快退出。
  • In the worker, we invoke the is_set() method of the Event object periodically to check if it should quit. 在worker中,我们定期调用Event对象的is_set()方法来检查它是否应该退出。
  • In the master, we need to use multiprocessing.JoinableQueue instead of multiprocessing.Queue so that it can test if the queue has been consumed completely by the workers before it asks the workers to quit. 在master中,我们需要使用multiprocessing.JoinableQueue而不是multiprocessing.Queue以便它可以测试队列是否已被工作者完全占用,然后才要求工人退出。
  • In the worker, we need to invoke the task_done() method of the queue after every item from the queue is consumed. 在worker中,我们需要在消耗队列中的每个项目之后调用队列的task_done()方法。 This is necessary for the master to invoke the join() method of the queue to test if it has been emptied. 这对于master调用队列的join()方法来测试它是否已被清空是必要的。

All of these changes can be found in the code below: 所有这些更改都可以在以下代码中找到:

import multiprocessing
import time
import random
import queue


def worker(input_queue, stop_event):
    while not stop_event.is_set():
        try:
            # Check if any URL has arrived in the input queue. If not,
            # loop back and try again.
            url = input_queue.get(True, 1)
            input_queue.task_done()
        except queue.Empty:
            continue

        print('Started working on:', url)

        # Random delay to simulate fake processing.
        time.sleep(random.randint(1, 3))

        print('Stopped working on:', url)


def master():
    urls = [
        'https://example.com/',
        'https://example.org/',
        'https://example.net/',
        'https://stackoverflow.com/',
        'https://www.python.org/',
        'https://github.com/',
        'https://susam.in/',
    ]

    input_queue = multiprocessing.JoinableQueue()
    stop_event = multiprocessing.Event()

    workers = []

    # Create workers.
    for i in range(5):
        p = multiprocessing.Process(target=worker,
                                    args=(input_queue, stop_event))
        workers.append(p)
        p.start()

    # Distribute work.
    for url in urls:
        input_queue.put(url)

    # Wait for the queue to be consumed.
    input_queue.join()

    # Ask the workers to quit.
    stop_event.set()

    # Wait for workers to quit.
    for w in workers:
        w.join()

    print('Done')


if __name__ == '__main__':
    master()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM