简体   繁体   中英

threading: function seems to run as a blocking loop although i am using threading

I am trying to speed up web scraping by running my http requests in a ThreadPoolExecutor from the concurrent.futures library.

Here is the code:

import concurrent.futures
import requests
from bs4 import BeautifulSoup


urls = [
        'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=ibfxcfd&showcategories=CFD',
        'https://www.interactivebrokers.eu/en/index.php?f=41634&exch=chix_ca',
        'https://www.interactivebrokers.eu/en/index.php?f=41634&exch=tase',
        'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=chixen-be&showcategories=STK',
        'https://www.interactivebrokers.eu/en/index.php?f=41295&exch=bvme&showcategories=STK'
        ]

def get_url(url):
    print(url)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    a = soup.select_one('a')
    print(a)


with concurrent.futures.ThreadPoolExecutor(max_workers=12) as executor:
    results = {executor.submit( get_url(url)) : url for url in urls}

    for future in concurrent.futures.as_completed(results):
        try:
            pass
        except Exception as exc:
            print('ERROR for symbol:', results[future])
            print(exc)

However when looking at how the scripts print in the CLI, it seems that the requests are sent in a blocking loop.

Additionaly if i run the code by using the below, i an see that it is taking roughly the same time.

for u in urls:
    get_url(u)

I have add some success in implementing concurrency using that library before, and i am at loss regarding what is going wrong here.

I am aware of the existence of the asyncio library as an alternative, but I would be keen on using threading instead.

You're not actually running your get_url calls as tasks; you call them in the main thread, and pass the result to executor.submit , experiencing the concurrent.futures analog to this problem with raw threading.Thread usage . Change:

results = {executor.submit( get_url(url)) : url for url in urls}

to:

results = {executor.submit(get_url, url) : url for url in urls}

so you pass the function to call and its arguments to the submit call (which then runs them in threads for you) and it should parallelize your code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM