简体   繁体   中英

Running multiple functions that make HTTP requests in parallel

I'm working on a script that autonomously scrapes historical data from several websites and saves them to the same excel file for each past date that's within a specified date range. Each individual function accesses several webpages from a different website, formats the data, and writes it to the file on separate sheets. Because I am continuously making requests on these sites, I make sure that I add ample sleep time between requests. Instead of running these functions one after another, is there a way that I could run them together?

I want to make one request with Function 1, then make one request with Function 2, and so on until all functions have made one request. After all functions have made a request, I would like it to loop back and complete the second request within each function (and so on) until all requests for a given date are complete. Doing this would allow the same amount of sleep time between requests on each website while decreasing the time the code would take to run by a large amount. One point to note is that each function makes a slightly different number of HTTP requests. For instance, Function 1 may make 10 requests on a given date while Function 2 makes 8 requests, Function 3 makes 8, Function 4 makes 7, and Function 5 makes 10.

I've read into this topic and have read about multithreading, but I am unsure how to apply this to my specific scenario. If there is no way to do this, I could run each function as its own code and run them at the same time, but then I would have to concatenate five different excel files for each date, which is why I am trying to do it this way.

start_date = 'YYYY-MM-DD'
end_date = 'YYYY-MM-DD'
idx = pd.date_range(start_date,end_date)
date_range = [d.strftime('%Y-%m-%d') for d in idx]
max_retries_min_sleeptime = 300
max_retries_max_sleeptime = 600
min_sleeptime = 150
max_sleeptime = 250
for date in date_range:
    writer = pd.ExcelWriter('Daily Data -' + date + '.xlsx')
    Function1()
    Function2()
    Function3()
    Function4()
    Function5()
    writer.save()
    print('Date Complete: ' + date)
    time.sleep(random.randrange(min_sleeptime,max_sleeptime,1))

Using Python3.6

Here is a minimal example of concurrent requests with aiohttp to get you started ( docs ). This example runs 3 downloader 's at the same time, appending the rsp to responses. I believe you will be able to adapt this idea to your problem.

import asyncio

from aiohttp.client import ClientSession


async def downloader(session, iter_url, responses):
    while True:
        try:
            url = next(iter_url)
        except StopIteration:
            return
        rsp = await session.get(url)
        if not rsp.status == 200:
            continue  # < - Or raise error
        responses.append(rsp)


async def run(urls, responses):
    with ClientSession() as session:
        iter_url = iter(urls)
        await asyncio.gather(*[downloader(session, iter_url, responses) for _ in range(3)])


urls = [
    'https://stackoverflow.com/questions/tagged/python',
    'https://aiohttp.readthedocs.io/en/stable/',
    'https://docs.python.org/3/library/asyncio.html'
]

responses = []

loop = asyncio.get_event_loop()
loop.run_until_complete(run(urls, responses))

Result:

>>> responses
[<ClientResponse(https://docs.python.org/3/library/asyncio.html) [200 OK]>
<CIMultiDictProxy('Server': 'nginx', 'Content-Type': 'text/html', 'Last-Modified': 'Sun, 28 Jan 2018 05:08:54 GMT', 'ETag': '"5a6d5ae6-6eae"', 'X-Clacks-Overhead': 'GNU Terry Pratchett', 'Strict-Transport-Security': 'max-age=315360000; includeSubDomains; preload', 'Via': '1.1 varnish', 'Fastly-Debug-Digest': '79eb68156ce083411371cd4dbd0cb190201edfeb12e5d1a8a1e273cc2c8d0e41', 'Content-Length': '28334', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 28 Jan 2018 23:48:17 GMT', 'Via': '1.1 varnish', 'Age': '66775', 'Connection': 'keep-alive', 'X-Served-By': 'cache-iad2140-IAD, cache-mel6520-MEL', 'X-Cache': 'HIT, HIT', 'X-Cache-Hits': '1, 1', 'X-Timer': 'S1517183297.337465,VS0,VE1')>
, <ClientResponse(https://stackoverflow.com/questions/tagged/python) [200 OK]>
<CIMultiDictProxy('Content-Type': 'text/html; charset=utf-8', 'Content-Encoding': 'gzip', 'X-Frame-Options': 'SAMEORIGIN', 'X-Request-Guid': '3fb98f74-2a89-497d-8d43-322f9a202775', 'Strict-Transport-Security': 'max-age=15552000', 'Content-Length': '23775', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 28 Jan 2018 23:48:17 GMT', 'Via': '1.1 varnish', 'Age': '0', 'Connection': 'keep-alive', 'X-Served-By': 'cache-mel6520-MEL', 'X-Cache': 'MISS', 'X-Cache-Hits': '0', 'X-Timer': 'S1517183297.107658,VS0,VE265', 'Vary': 'Accept-Encoding,Fastly-SSL', 'X-DNS-Prefetch-Control': 'off', 'Set-Cookie': 'prov=8edb36d8-8c63-bdd5-8d56-19bf14916c93; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly', 'Cache-Control': 'private')>
, <ClientResponse(https://aiohttp.readthedocs.io/en/stable/) [200 OK]>
<CIMultiDictProxy('Server': 'nginx/1.10.3 (Ubuntu)', 'Date': 'Sun, 28 Jan 2018 23:48:18 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Wed, 17 Jan 2018 08:45:22 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'ETag': 'W/"5a5f0d22-578a"', 'X-Subdomain-TryFiles': 'True', 'X-Served': 'Nginx', 'X-Deity': 'web01', 'Content-Encoding': 'gzip')>
]

Here is a minimal example to demonstrate how to use concurrent.futures for parallel processing. This does not include the actual scraping logic as you can add it yourself, if needed, but demonstrates the pattern to follow:

from concurrent import futures
from concurrent.futures import ThreadPoolExecutor

def scrape_func(*args, **kwargs):
    """ Stub function to use with futures - your scraping logic """
    print("Do something in parallel")
    return "result scraped"

def main():
    start_date = 'YYYY-MM-DD'
    end_date = 'YYYY-MM-DD'
    idx = pd.date_range(start_date,end_date)
    date_range = [d.strftime('%Y-%m-%d') for d in idx]
    max_retries_min_sleeptime = 300
    max_retries_max_sleeptime = 600
    min_sleeptime = 150
    max_sleeptime = 250

    # The important part - concurrent futures 
    # - set number of workers as the number of jobs to process

    with ThreadPoolExecutor(len(date_range)) as executor:
        # Use list jobs for concurrent futures
        # Use list scraped_results for results
        jobs = []
        scraped_results = []

        for date in date_range:
            # Pass some keyword arguments if needed - per job    
            kw = {"some_param": "value"}

            # Here we iterate 'number of dates' times, could be different
            # We're adding scrape_func, could be different function per call
            jobs.append(executor.submit(scrape_func, **kw))

        # Once parallell processing is complete, iterate over results
        for job in futures.as_completed(jobs):
            # Read result from future
            scraped_result = job.result()
            # Append to the list of results
            scraped_results.append(scraped_result)

        # Iterate over results scraped and do whatever is needed
        for result is scraped_results:
            print("Do something with me {}".format(result))


if __name__=="__main__":
    main()

As mentioned, this is just to demonstrate the pattern to follow, the rest should be straightforward.

Thanks for the responses guys! As it turns out a pretty simple block of code from this other question ( Make 2 functions run at the same time ) seems to do what I want.

import threading
from threading import Thread

def func1():
    print 'Working'

def func2():
    print 'Working'

if __name__ == '__main__':
    Thread(target = func1).start()
    Thread(target = func2).start()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM