简体   繁体   English

在 Python 中使用多处理加速网站请求

[英]Using multiprocessing to speed up website requests in Python

I am using the requests module to download the content of many websites, which looks something like this:我正在使用 requests 模块下载许多网站的内容,如下所示:

import requests
for i in range(1000):
    url = base_url + f"{i}.anything"
    r = requests.get(url)

Of course this is simplified, but basically the base url is always the same, I only want to download an image, for example.当然这是简化的,但基本 url 基本都是一样的,我只想下载一个图像,例如。 This takes very long due to the amount of iterations.由于迭代的数量,这需要很长时间。 The internet connection is not the problem, but rather the amount of time it takes to start a request etc. So I was thinking about something like multiprocessing, because this task is basically always the same and I could imagine it to be a lot faster when multiprocessed.互联网连接不是问题,而是启动请求所需的时间等。所以我在考虑多处理之类的东西,因为这个任务基本上总是一样的,我可以想象它在多处理。

Is this somehow doable?这在某种程度上可行吗? Thanks in advance!提前致谢!

I would suggest that in this case, the lightweight thread would be better.我建议在这种情况下,轻量级线程会更好。 When I ran the request on a certain URL 5 times, the result was:当我在某个 URL 上运行请求 5 次时,结果是:

Threads: Finished in 0.24 second(s)
MultiProcess: Finished in 0.77 second(s)

Your implementation can be something like this:您的实现可能是这样的:

import concurrent.futures
import requests
from bs4 import BeautifulSoup
import time

def access_url(url,No):
    print(f"{No}:==> {url}")
    response=requests.get(url)
    soup=BeautifulSoup(response.text,features='lxml')
    return ("{} :  {}".format(No, str(soup.title)[7:50]))

if __name__ == "__main__":
    test_url="http://bla bla.com/"
    base_url=test_url
    THREAD_MULTI_PROCESSING= True
    start = time.perf_counter() # calculate the time
    url_list=[base_url for i in range(5)] # setting parameter for function as a list so map can be used.
    url_counter=[i for i in range(5)] # setting parameter for function as a list so map can be used.
    if THREAD_MULTI_PROCESSING:
        with concurrent.futures.ThreadPoolExecutor() as executor: # In this case thread would be better
            results = executor.map(access_url,url_list,url_counter)
        for result in results:
            print(result)
    end = time.perf_counter() # calculate finish time
    print(f'Threads: Finished in {round(end - start,2)} second(s)')

    start = time.perf_counter()
    PROCESS_MULTI_PROCESSING=True
    if PROCESS_MULTI_PROCESSING:
        with concurrent.futures.ProcessPoolExecutor() as executor:
            results = executor.map(access_url,url_list,url_counter)
        for result in results:
            print(result)
    end = time.perf_counter()
    print(f'Threads: Finished in {round(end - start,2)} second(s)')

I think you will see better performance in your case.我认为你会在你的情况下看到更好的表现。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM