简体   繁体   English

urllib3 中的并发连接

[英]concurrent connections in urllib3

Using a loop to make multiple requests to various websites, how is it possible to do this with a proxy in urllib3?使用循环向各种网站发出多个请求,如何使用 urllib3 中的代理来执行此操作?

The code will read in a tuple of URLs, and use a for loop to connect to each site, however, currently it does not connect past the first url in the tuple.该代码将读取 URL 元组,并使用 for 循环连接到每个站点,但是,目前它不会连接到元组中的第一个 url。 There is a proxy in place as well.也有一个代理。

list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']
for i in list:
    http = ProxyManager("PROXY-PROXY")
    http_get = http.request('GET', i, preload_content=False).read().decode()

I have removed the urls and proxy information from the above code.我已经从上面的代码中删除了 url 和代理信息。 The first URL in the tuple will run fine, but after this, nothing else occurs, just waiting.元组中的第一个 URL 可以正常运行,但在此之后,没有其他任何事情发生,只是在等待。 I have tried the clear() method to reset the connection for each time in the loop.我已经尝试了clear()方法来重置循环中每次的连接。

unfortunately urllib3 is synchronous and blocks.不幸的是 urllib3 是同步的并且阻塞。 You could use it with threads, but that is a hassle and usually leads to more problems.您可以将它与线程一起使用,但这很麻烦并且通常会导致更多问题。 The main approach these days is to use some asynchronous network.这些天的主要方法是使用一些异步网络。 Twisted and asyncio (with aiohttp maybe) are the popular packages. Twisted 和 asyncio(可能带有 aiohttp)是流行的包。

I'll provide an example using trio framework and asks :我将提供一个使用trio框架的示例并asks

import asks
import trio
asks.init('trio')

path_list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']

results = []

async def grabber(path):
    r = await s.get(path)
    results.append(r)

async def main(path_list):
    async with trio.open_nursery() as n:
        for path in path_list:
            n.spawn(grabber(path))

s = asks.Session()
trio.run(main, path_list)

Using threads is not really that much of a hassle since 3.2 when concurrent.futures was added:自 3.2 添加concurrent.futures以来,使用线程并不是那么麻烦:

from urllib3 import ProxyManager
from concurrent.futures import ThreadPoolExecutor,wait

url_list:list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']
thread_pool:ThreadPoolExecutor = ThreadPoolExecutor(max_workers=min(len(url_list),20))
tasks = []

for url in url_list:
    def send_request() -> type:
        # copy i into this function's stack frame
        this_url:str = url
        # could this assignment be removed from the loop? 
        # I'd have to read the docs for ProxyManager but probably
        http:ProxyManager = ProxyManager("PROXY-PROXY")
        return http.request('GET', this_url, preload_content=False).read().decode()
    tasks.append(thread_pool.submit(send_request))

wait(tasks)
all_responses:list = [task.result() for task in tasks]

Later versions offer an event loop via asyncio .更高版本通过asyncio提供事件循环。 Issues I've had with asyncio are usually related to portability of libraries (IE aiohttp via pydantic ), most of which are not pure python and have external libc dependencies.我遇到的asyncio问题通常与库的可移植性有关(IE aiohttp via pydantic ),其中大部分不是纯 python 并且具有外部 libc 依赖项。 This can be an issue if you have to support a lot of docker apps which might have musl-libc (alpine) or glibc (everyone else).如果您必须支持许多可能具有musl-libc (alpine)或glibc (其他所有人)的 docker 应用程序,这可能是一个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM