简体   繁体   中英

concurrent connections in urllib3

Using a loop to make multiple requests to various websites, how is it possible to do this with a proxy in urllib3?

The code will read in a tuple of URLs, and use a for loop to connect to each site, however, currently it does not connect past the first url in the tuple. There is a proxy in place as well.

list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']
for i in list:
    http = ProxyManager("PROXY-PROXY")
    http_get = http.request('GET', i, preload_content=False).read().decode()

I have removed the urls and proxy information from the above code. The first URL in the tuple will run fine, but after this, nothing else occurs, just waiting. I have tried the clear() method to reset the connection for each time in the loop.

unfortunately urllib3 is synchronous and blocks. You could use it with threads, but that is a hassle and usually leads to more problems. The main approach these days is to use some asynchronous network. Twisted and asyncio (with aiohttp maybe) are the popular packages.

I'll provide an example using trio framework and asks :

import asks
import trio
asks.init('trio')

path_list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']

results = []

async def grabber(path):
    r = await s.get(path)
    results.append(r)

async def main(path_list):
    async with trio.open_nursery() as n:
        for path in path_list:
            n.spawn(grabber(path))

s = asks.Session()
trio.run(main, path_list)

Using threads is not really that much of a hassle since 3.2 when concurrent.futures was added:

from urllib3 import ProxyManager
from concurrent.futures import ThreadPoolExecutor,wait

url_list:list = ['https://URL1.com', 'http://URL2.com', 'http://URL3.com']
thread_pool:ThreadPoolExecutor = ThreadPoolExecutor(max_workers=min(len(url_list),20))
tasks = []

for url in url_list:
    def send_request() -> type:
        # copy i into this function's stack frame
        this_url:str = url
        # could this assignment be removed from the loop? 
        # I'd have to read the docs for ProxyManager but probably
        http:ProxyManager = ProxyManager("PROXY-PROXY")
        return http.request('GET', this_url, preload_content=False).read().decode()
    tasks.append(thread_pool.submit(send_request))

wait(tasks)
all_responses:list = [task.result() for task in tasks]

Later versions offer an event loop via asyncio . Issues I've had with asyncio are usually related to portability of libraries (IE aiohttp via pydantic ), most of which are not pure python and have external libc dependencies. This can be an issue if you have to support a lot of docker apps which might have musl-libc (alpine) or glibc (everyone else).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM