运行可并行发出HTTP请求的多个功能

Question

我正在使用一个脚本，该脚本会自动从多个网站上抓取历史数据，并在指定日期范围内的每个过去日期将它们保存到相同的excel文件中。 每个功能都可以从不同的网站访问多个网页，对数据进行格式化，然后将其写入单独页面中的文件中。 因为我不断在这些站点上发出请求，所以请确保在两次请求之间添加足够的睡眠时间。 而不是一个接一个地运行这些功能，有没有一种方法可以让我一起运行它们？

我想用功能1发出一个请求，然后用功能2发出一个请求，依此类推，直到所有功能都发出一个请求。 在所有功能都发出请求之后，我希望它循环返回并完成每个功能内的第二个请求（依此类推），直到完成给定日期的所有请求为止。 这样做将在每个网站上的请求之间允许相同数量的睡眠时间，同时减少大量代码运行时间。 需要注意的一点是，每个函数发出的HTTP请求数量略有不同。 例如，功能1可能在给定日期发出10个请求，而功能2发出8个请求，功能3发出8个，功能4发出7个，功能5发出10个。

我已经阅读了本主题并阅读了有关多线程的信息，但是我不确定如何将其应用于我的特定方案。 如果没有办法，我可以将每个函数作为自己的代码运行并同时运行，但是随后我必须为每个日期连接五个不同的excel文件，这就是为什么我要这样做这条路。

start_date = 'YYYY-MM-DD'
end_date = 'YYYY-MM-DD'
idx = pd.date_range(start_date,end_date)
date_range = [d.strftime('%Y-%m-%d') for d in idx]
max_retries_min_sleeptime = 300
max_retries_max_sleeptime = 600
min_sleeptime = 150
max_sleeptime = 250
for date in date_range:
    writer = pd.ExcelWriter('Daily Data -' + date + '.xlsx')
    Function1()
    Function2()
    Function3()
    Function4()
    Function5()
    writer.save()
    print('Date Complete: ' + date)
    time.sleep(random.randrange(min_sleeptime,max_sleeptime,1))

Answer 1

使用Python3.6

这是使用aiohttp并发请求的最小示例，以帮助您入门（ docs ）。 此示例同时运行3个downloader ，并将rsp附加到响应中。 我相信您将能够使这个想法适应您的问题。

import asyncio

from aiohttp.client import ClientSession


async def downloader(session, iter_url, responses):
    while True:
        try:
            url = next(iter_url)
        except StopIteration:
            return
        rsp = await session.get(url)
        if not rsp.status == 200:
            continue  # < - Or raise error
        responses.append(rsp)


async def run(urls, responses):
    with ClientSession() as session:
        iter_url = iter(urls)
        await asyncio.gather(*[downloader(session, iter_url, responses) for _ in range(3)])


urls = [
    'https://stackoverflow.com/questions/tagged/python',
    'https://aiohttp.readthedocs.io/en/stable/',
    'https://docs.python.org/3/library/asyncio.html'
]

responses = []

loop = asyncio.get_event_loop()
loop.run_until_complete(run(urls, responses))

结果：

>>> responses
[<ClientResponse(https://docs.python.org/3/library/asyncio.html) [200 OK]>
<CIMultiDictProxy('Server': 'nginx', 'Content-Type': 'text/html', 'Last-Modified': 'Sun, 28 Jan 2018 05:08:54 GMT', 'ETag': '"5a6d5ae6-6eae"', 'X-Clacks-Overhead': 'GNU Terry Pratchett', 'Strict-Transport-Security': 'max-age=315360000; includeSubDomains; preload', 'Via': '1.1 varnish', 'Fastly-Debug-Digest': '79eb68156ce083411371cd4dbd0cb190201edfeb12e5d1a8a1e273cc2c8d0e41', 'Content-Length': '28334', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 28 Jan 2018 23:48:17 GMT', 'Via': '1.1 varnish', 'Age': '66775', 'Connection': 'keep-alive', 'X-Served-By': 'cache-iad2140-IAD, cache-mel6520-MEL', 'X-Cache': 'HIT, HIT', 'X-Cache-Hits': '1, 1', 'X-Timer': 'S1517183297.337465,VS0,VE1')>
, <ClientResponse(https://stackoverflow.com/questions/tagged/python) [200 OK]>
<CIMultiDictProxy('Content-Type': 'text/html; charset=utf-8', 'Content-Encoding': 'gzip', 'X-Frame-Options': 'SAMEORIGIN', 'X-Request-Guid': '3fb98f74-2a89-497d-8d43-322f9a202775', 'Strict-Transport-Security': 'max-age=15552000', 'Content-Length': '23775', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 28 Jan 2018 23:48:17 GMT', 'Via': '1.1 varnish', 'Age': '0', 'Connection': 'keep-alive', 'X-Served-By': 'cache-mel6520-MEL', 'X-Cache': 'MISS', 'X-Cache-Hits': '0', 'X-Timer': 'S1517183297.107658,VS0,VE265', 'Vary': 'Accept-Encoding,Fastly-SSL', 'X-DNS-Prefetch-Control': 'off', 'Set-Cookie': 'prov=8edb36d8-8c63-bdd5-8d56-19bf14916c93; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly', 'Cache-Control': 'private')>
, <ClientResponse(https://aiohttp.readthedocs.io/en/stable/) [200 OK]>
<CIMultiDictProxy('Server': 'nginx/1.10.3 (Ubuntu)', 'Date': 'Sun, 28 Jan 2018 23:48:18 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Wed, 17 Jan 2018 08:45:22 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'ETag': 'W/"5a5f0d22-578a"', 'X-Subdomain-TryFiles': 'True', 'X-Served': 'Nginx', 'X-Deity': 'web01', 'Content-Encoding': 'gzip')>
]

Answer 2

这里是一个小例子，以说明如何使用concurrent.futures用于并行处理。 这不包括实际的抓取逻辑，因为您可以根据需要自己添加它，但是演示了遵循的模式：

from concurrent import futures
from concurrent.futures import ThreadPoolExecutor

def scrape_func(*args, **kwargs):
    """ Stub function to use with futures - your scraping logic """
    print("Do something in parallel")
    return "result scraped"

def main():
    start_date = 'YYYY-MM-DD'
    end_date = 'YYYY-MM-DD'
    idx = pd.date_range(start_date,end_date)
    date_range = [d.strftime('%Y-%m-%d') for d in idx]
    max_retries_min_sleeptime = 300
    max_retries_max_sleeptime = 600
    min_sleeptime = 150
    max_sleeptime = 250

    # The important part - concurrent futures 
    # - set number of workers as the number of jobs to process

    with ThreadPoolExecutor(len(date_range)) as executor:
        # Use list jobs for concurrent futures
        # Use list scraped_results for results
        jobs = []
        scraped_results = []

        for date in date_range:
            # Pass some keyword arguments if needed - per job    
            kw = {"some_param": "value"}

            # Here we iterate 'number of dates' times, could be different
            # We're adding scrape_func, could be different function per call
            jobs.append(executor.submit(scrape_func, **kw))

        # Once parallell processing is complete, iterate over results
        for job in futures.as_completed(jobs):
            # Read result from future
            scraped_result = job.result()
            # Append to the list of results
            scraped_results.append(scraped_result)

        # Iterate over results scraped and do whatever is needed
        for result is scraped_results:
            print("Do something with me {}".format(result))


if __name__=="__main__":
    main()

如前所述，这只是为了演示遵循的模式，其余的应该很简单。

Answer 3

谢谢你们的回应！ 事实证明，来自另一个问题的非常简单的代码块（使2个函数同时运行）似乎可以满足我的要求。

import threading
from threading import Thread

def func1():
    print 'Working'

def func2():
    print 'Working'

if __name__ == '__main__':
    Thread(target = func1).start()
    Thread(target = func2).start()

运行可并行发出HTTP请求的多个功能

问题描述

3 个解决方案

解决方案1
1 2018-01-28 23:51:21

解决方案2
1 2018-01-29 00:32:49

解决方案3
0 2018-01-29 04:10:21

运行可并行发出HTTP请求的多个功能

问题描述

3 个解决方案

解决方案1 1 2018-01-28 23:51:21

解决方案2 1 2018-01-29 00:32:49

解决方案3 0 2018-01-29 04:10:21

解决方案1
1 2018-01-28 23:51:21

解决方案2
1 2018-01-29 00:32:49

解决方案3
0 2018-01-29 04:10:21