![](/img/trans.png)
[英]How to create HTTP Azure Functions that will accept multiple requests in parallel?
[英]Running multiple functions that make HTTP requests in parallel
我正在使用一个脚本,该脚本会自动从多个网站上抓取历史数据,并在指定日期范围内的每个过去日期将它们保存到相同的excel文件中。 每个功能都可以从不同的网站访问多个网页,对数据进行格式化,然后将其写入单独页面中的文件中。 因为我不断在这些站点上发出请求,所以请确保在两次请求之间添加足够的睡眠时间。 而不是一个接一个地运行这些功能,有没有一种方法可以让我一起运行它们?
我想用功能1发出一个请求,然后用功能2发出一个请求,依此类推,直到所有功能都发出一个请求。 在所有功能都发出请求之后,我希望它循环返回并完成每个功能内的第二个请求(依此类推),直到完成给定日期的所有请求为止。 这样做将在每个网站上的请求之间允许相同数量的睡眠时间,同时减少大量代码运行时间。 需要注意的一点是,每个函数发出的HTTP请求数量略有不同。 例如,功能1可能在给定日期发出10个请求,而功能2发出8个请求,功能3发出8个,功能4发出7个,功能5发出10个。
我已经阅读了本主题并阅读了有关多线程的信息,但是我不确定如何将其应用于我的特定方案。 如果没有办法,我可以将每个函数作为自己的代码运行并同时运行,但是随后我必须为每个日期连接五个不同的excel文件,这就是为什么我要这样做这条路。
start_date = 'YYYY-MM-DD'
end_date = 'YYYY-MM-DD'
idx = pd.date_range(start_date,end_date)
date_range = [d.strftime('%Y-%m-%d') for d in idx]
max_retries_min_sleeptime = 300
max_retries_max_sleeptime = 600
min_sleeptime = 150
max_sleeptime = 250
for date in date_range:
writer = pd.ExcelWriter('Daily Data -' + date + '.xlsx')
Function1()
Function2()
Function3()
Function4()
Function5()
writer.save()
print('Date Complete: ' + date)
time.sleep(random.randrange(min_sleeptime,max_sleeptime,1))
使用Python3.6
这是使用aiohttp
并发请求的最小示例,以帮助您入门( docs )。 此示例同时运行3个downloader
,并将rsp
附加到响应中。 我相信您将能够使这个想法适应您的问题。
import asyncio
from aiohttp.client import ClientSession
async def downloader(session, iter_url, responses):
while True:
try:
url = next(iter_url)
except StopIteration:
return
rsp = await session.get(url)
if not rsp.status == 200:
continue # < - Or raise error
responses.append(rsp)
async def run(urls, responses):
with ClientSession() as session:
iter_url = iter(urls)
await asyncio.gather(*[downloader(session, iter_url, responses) for _ in range(3)])
urls = [
'https://stackoverflow.com/questions/tagged/python',
'https://aiohttp.readthedocs.io/en/stable/',
'https://docs.python.org/3/library/asyncio.html'
]
responses = []
loop = asyncio.get_event_loop()
loop.run_until_complete(run(urls, responses))
结果:
>>> responses
[<ClientResponse(https://docs.python.org/3/library/asyncio.html) [200 OK]>
<CIMultiDictProxy('Server': 'nginx', 'Content-Type': 'text/html', 'Last-Modified': 'Sun, 28 Jan 2018 05:08:54 GMT', 'ETag': '"5a6d5ae6-6eae"', 'X-Clacks-Overhead': 'GNU Terry Pratchett', 'Strict-Transport-Security': 'max-age=315360000; includeSubDomains; preload', 'Via': '1.1 varnish', 'Fastly-Debug-Digest': '79eb68156ce083411371cd4dbd0cb190201edfeb12e5d1a8a1e273cc2c8d0e41', 'Content-Length': '28334', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 28 Jan 2018 23:48:17 GMT', 'Via': '1.1 varnish', 'Age': '66775', 'Connection': 'keep-alive', 'X-Served-By': 'cache-iad2140-IAD, cache-mel6520-MEL', 'X-Cache': 'HIT, HIT', 'X-Cache-Hits': '1, 1', 'X-Timer': 'S1517183297.337465,VS0,VE1')>
, <ClientResponse(https://stackoverflow.com/questions/tagged/python) [200 OK]>
<CIMultiDictProxy('Content-Type': 'text/html; charset=utf-8', 'Content-Encoding': 'gzip', 'X-Frame-Options': 'SAMEORIGIN', 'X-Request-Guid': '3fb98f74-2a89-497d-8d43-322f9a202775', 'Strict-Transport-Security': 'max-age=15552000', 'Content-Length': '23775', 'Accept-Ranges': 'bytes', 'Date': 'Sun, 28 Jan 2018 23:48:17 GMT', 'Via': '1.1 varnish', 'Age': '0', 'Connection': 'keep-alive', 'X-Served-By': 'cache-mel6520-MEL', 'X-Cache': 'MISS', 'X-Cache-Hits': '0', 'X-Timer': 'S1517183297.107658,VS0,VE265', 'Vary': 'Accept-Encoding,Fastly-SSL', 'X-DNS-Prefetch-Control': 'off', 'Set-Cookie': 'prov=8edb36d8-8c63-bdd5-8d56-19bf14916c93; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly', 'Cache-Control': 'private')>
, <ClientResponse(https://aiohttp.readthedocs.io/en/stable/) [200 OK]>
<CIMultiDictProxy('Server': 'nginx/1.10.3 (Ubuntu)', 'Date': 'Sun, 28 Jan 2018 23:48:18 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Wed, 17 Jan 2018 08:45:22 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'ETag': 'W/"5a5f0d22-578a"', 'X-Subdomain-TryFiles': 'True', 'X-Served': 'Nginx', 'X-Deity': 'web01', 'Content-Encoding': 'gzip')>
]
这里是一个小例子,以说明如何使用concurrent.futures
用于并行处理。 这不包括实际的抓取逻辑,因为您可以根据需要自己添加它,但是演示了遵循的模式:
from concurrent import futures
from concurrent.futures import ThreadPoolExecutor
def scrape_func(*args, **kwargs):
""" Stub function to use with futures - your scraping logic """
print("Do something in parallel")
return "result scraped"
def main():
start_date = 'YYYY-MM-DD'
end_date = 'YYYY-MM-DD'
idx = pd.date_range(start_date,end_date)
date_range = [d.strftime('%Y-%m-%d') for d in idx]
max_retries_min_sleeptime = 300
max_retries_max_sleeptime = 600
min_sleeptime = 150
max_sleeptime = 250
# The important part - concurrent futures
# - set number of workers as the number of jobs to process
with ThreadPoolExecutor(len(date_range)) as executor:
# Use list jobs for concurrent futures
# Use list scraped_results for results
jobs = []
scraped_results = []
for date in date_range:
# Pass some keyword arguments if needed - per job
kw = {"some_param": "value"}
# Here we iterate 'number of dates' times, could be different
# We're adding scrape_func, could be different function per call
jobs.append(executor.submit(scrape_func, **kw))
# Once parallell processing is complete, iterate over results
for job in futures.as_completed(jobs):
# Read result from future
scraped_result = job.result()
# Append to the list of results
scraped_results.append(scraped_result)
# Iterate over results scraped and do whatever is needed
for result is scraped_results:
print("Do something with me {}".format(result))
if __name__=="__main__":
main()
如前所述,这只是为了演示遵循的模式,其余的应该很简单。
谢谢你们的回应! 事实证明,来自另一个问题的非常简单的代码块( 使2个函数同时运行 )似乎可以满足我的要求。
import threading
from threading import Thread
def func1():
print 'Working'
def func2():
print 'Working'
if __name__ == '__main__':
Thread(target = func1).start()
Thread(target = func2).start()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.