[英]aiohttp: rate limiting requests-per-second by domain
I am writing a web crawler that is running parallel fetches for many different domains.我正在编写一个网络爬虫,它为许多不同的域运行并行提取。 I want to limit the number of requests-per-second that are made to each individual domain , but I do not care about the total number of connections that are open, or the total requests per second that are made across all domains.
我想限制对每个域发出的每秒请求数,但我不关心打开的连接总数,或者所有域中发出的每秒请求总数。 I want to maximize the number of open connections and requests-per-second overall, while limiting the number of requests-per-second made to individual domains.
我想最大限度地增加打开连接数和每秒请求数,同时限制对单个域的每秒请求数。
All of the currently existing examples I can find either (1) limit the number of open connections or (2) limit the total number of requests-per-second made in the fetch loop.我可以找到的所有当前存在的示例(1)限制打开连接的数量或(2)限制在获取循环中每秒发出的请求总数。 Examples include:
例子包括:
Neither of them do what I am requesting which is to limit requests-per-second on a per domain basis.他们都没有做我所要求的,即在每个域的基础上限制每秒请求数。 The first question only answers how to limit requests-per-second overall.
第一个问题只回答如何限制每秒请求数。 The second one doesn't even have answers to the actual question (the OP asks about requests per second and the answers all talk about limiting # of connections).
第二个甚至没有对实际问题的答案(OP 询问每秒请求数,答案都在谈论限制连接数)。
Here is the code that I tried, using a simple rate limiter I made for a synchronous version, which doesn't work when the DomainTimer code is run in an async event loop:这是我尝试的代码,使用我为同步版本制作的简单速率限制器,当 DomainTimer 代码在异步事件循环中运行时,它不起作用:
from collections import defaultdict
from datetime import datetime, timedelta
import asyncio
import async_timeout
import aiohttp
from urllib.parse import urlparse
from queue import Queue, Empty
from HTMLProcessing import processHTML
import URLFilters
SEED_URLS = ['http://www.bbc.co.uk', 'http://www.news.google.com']
url_queue = Queue()
for u in SEED_URLS:
url_queue.put(u)
# number of pages to download per run of crawlConcurrent()
BATCH_SIZE = 100
DELAY = timedelta(seconds = 1.0) # delay between requests from single domain, in seconds
HTTP_HEADERS = {'Referer': 'http://www.google.com',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0'}
class DomainTimer():
def __init__(self):
self.timer = None
def resetTimer(self):
self.timer = datetime.now()
def delayExceeded(self, delay):
if not self.timer: #We haven't fetched this before
return True
if (datetime.now() - self.timer) >= delay:
return True
else:
return False
crawl_history = defaultdict(dict) # given a URL, when is last time crawled?
domain_timers = defaultdict(DomainTimer)
async def fetch(session, url):
domain = urlparse(url).netloc
print('here fetching ' + url + "\n")
dt = domain_timers[domain]
if dt.delayExceeded(DELAY) or not dt:
with async_timeout.timeout(10):
try:
dt.resetTimer() # reset domain timer
async with session.get(url, headers=HTTP_HEADERS) as response:
if response.status == 200:
crawl_history[url] = datetime.now()
html = await response.text()
return {'url': url, 'html': html}
else:
# log HTTP response, put into crawl_history so
# we don't attempt to fetch again
print(url + " failed with response: " + str(response.status) + "\n")
return {'url': url, 'http_status': response.status}
except aiohttp.ClientConnectionError as e:
print("Connection failed " + str(e))
except aiohttp.ClientPayloadError as e:
print("Recieved bad data from server @ " + url + "\n")
else: # Delay hasn't passed yet: skip for now & put @ end of q
url_queue.put(url);
return None
async def fetch_all(urls):
"""Launch requests for all web pages."""
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
task = asyncio.ensure_future(fetch(session, url))
tasks.append(task) # create list of tasks
return await asyncio.gather(*tasks) # gather task responses
def batch_crawl():
"""Launch requests for all web pages."""
start_time = datetime.now()
# Here we build the list of URLs to crawl for this batch
urls = []
for i in range(BATCH_SIZE):
try:
next_url = url_queue.get_nowait() # get next URL from queue
urls.append(next_url)
except Empty:
print("Processed all items in URL queue.\n")
break;
loop = asyncio.get_event_loop()
asyncio.set_event_loop(loop)
pages = loop.run_until_complete(fetch_all(urls))
crawl_time = (datetime.now() - start_time).seconds
print("Crawl completed. Fetched " + str(len(pages)) + " pages in " + str(crawl_time) + " seconds.\n")
return pages
def parse_html(pages):
""" Parse the HTML for each page downloaded in this batch"""
start_time = datetime.now()
results = {}
for p in pages:
if not p or not p['html']:
print("Received empty page")
continue
else:
url, html = p['url'], p['html']
results[url] = processHTML(html)
processing_time = (datetime.now() - start_time).seconds
print("HTML processing finished. Processed " + str(len(results)) + " pages in " + str(processing_time) + " seconds.\n")
return results
def extract_new_links(results):
"""Extract links from """
# later we could track where links were from here, anchor text, etc,
# and weight queue priority based on that
links = []
for k in results.keys():
new_urls = [l['href'] for l in results[k]['links']]
for u in new_urls:
if u not in crawl_history.keys():
links.append(u)
return links
def filterURLs(urls):
urls = URLFilters.filterDuplicates(urls)
urls = URLFilters.filterBlacklistedDomains(urls)
return urls
def run_batch():
pages = batch_crawl()
results = parse_html(pages)
links = extract_new_links(results)
for l in filterURLs(links):
url_queue.put(l)
return results
There are no errors or exceptions thrown, and the rate-limiting code works fine in for synchronous fetches, but the DomainTimer has no apparent effect when run in async loop.没有抛出错误或异常,限速代码在同步提取中工作正常,但 DomainTimer 在异步循环中运行时没有明显影响。 The delay of one request-per-second per domain is not upheld...
不支持每个域每秒一个请求的延迟......
How would I modify this synchronous rate limiting code to work within the async event loop?我将如何修改此同步速率限制代码以在异步事件循环中工作? Thanks!
谢谢!
It's hard to debug your code since it contains many unrelated stuff, it's easier to show idea on a new simple example. 调试代码很难,因为它包含许多不相关的东西,更容易在一个新的简单示例上显示想法。
Main idea: 大意:
Semaphore
-like class using __aenter__
, __aexit__
that accepts url (domain) __aenter__
, __aexit__
编写__aenter__
Semaphore
的类 Lock
to prevent multiple requests to the same domain Lock
来防止对同一域的多个请求 Code: 码:
import asyncio
import aiohttp
from urllib.parse import urlparse
from collections import defaultdict
class Limiter:
# domain -> req/sec:
_limits = {
'httpbin.org': 4,
'eu.httpbin.org': 1,
}
# domain -> it's lock:
_locks = defaultdict(lambda: asyncio.Lock())
# domain -> it's last request time
_times = defaultdict(lambda: 0)
def __init__(self, url):
self._host = urlparse(url).hostname
async def __aenter__(self):
await self._lock
to_wait = self._to_wait_before_request()
print(f'Wait {to_wait} sec before next request to {self._host}')
await asyncio.sleep(to_wait)
async def __aexit__(self, *args):
print(f'Request to {self._host} just finished')
self._update_request_time()
self._lock.release()
@property
def _lock(self):
"""Lock that prevents multiple requests to same host."""
return self._locks[self._host]
def _to_wait_before_request(self):
"""What time we need to wait before request to host."""
request_time = self._times[self._host]
request_delay = 1 / self._limits[self._host]
now = asyncio.get_event_loop().time()
to_wait = request_time + request_delay - now
to_wait = max(0, to_wait)
return to_wait
def _update_request_time(self):
now = asyncio.get_event_loop().time()
self._times[self._host] = now
# request that uses Limiter instead of Semaphore:
async def get(url):
async with Limiter(url):
async with aiohttp.ClientSession() as session: # TODO reuse session for different requests.
async with session.get(url) as resp:
return await resp.text()
# main:
async def main():
coros = [
get('http://httpbin.org/get'),
get('http://httpbin.org/get'),
get('http://httpbin.org/get'),
get('http://httpbin.org/get'),
get('http://httpbin.org/get'),
get('http://eu.httpbin.org/get'),
get('http://eu.httpbin.org/get'),
get('http://eu.httpbin.org/get'),
get('http://eu.httpbin.org/get'),
get('http://eu.httpbin.org/get'),
]
await asyncio.gather(*coros)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main())
finally:
loop.run_until_complete(loop.shutdown_asyncgens())
loop.close()
I developed a library named octopus-api ( https://pypi.org/project/octopus-api/ ), that enables you to rate limit and set the number of connections to the endpoint using aiohttp under the hood.我开发了一个名为 octopus-api ( https://pypi.org/project/octopus-api/ ) 的库,它使您能够在后台使用 aiohttp 限制速率并设置到端点的连接数。 The goal of it is to simplify all the aiohttp setup needed.
它的目标是简化所有需要的 aiohttp 设置。
Here is an example of how to use it, where the get_ethereum is the user-defined request function.下面是一个如何使用它的示例,其中get_ethereum是用户定义的请求函数。 It could have also been a web crawler function request or what ever fits:
它也可能是网络爬虫功能请求或适合的内容:
from octopus_api import TentacleSession, OctopusApi
from typing import Dict, List
if __name__ == '__main__':
async def get_ethereum(session: TentacleSession, request: Dict):
async with session.get(url=request["url"], params=request["params"]) as response:
body = await response.json()
return body
client = OctopusApi(rate=50, resolution="sec", connections=6)
result: List = client.execute(requests_list=[{
"url": "https://api.pro.coinbase.com/products/ETH-EUR/candles?granularity=900&start=2021-12-04T00:00:00Z&end=2021-12-04T00:00:00Z",
"params": {}}] * 1000, func=get_ethereum)
print(result)
The TentacleSession works the same as how you write POST, GET, PUT and PATCH for aiohttp.ClientSession. TentacleSession 的工作方式与您为 aiohttp.ClientSession 编写 POST、GET、PUT 和 PATCH 的方式相同。
Let me know if it helps your issue related to rate limits and connection for crawling.让我知道它是否有助于您解决与速率限制和爬行连接相关的问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.