简体   繁体   English

限制/限制GRequests中的HTTP请求速率

[英]Limiting/throttling the rate of HTTP requests in GRequests

I'm writing a small script in Python 2.7.3 with GRequests and lxml that will allow me to gather some collectible card prices from various websites and compare them. 我正在使用GRequests和lxml在Python 2.7.3中编写一个小脚本,这将允许我从各个网站收集一些可收集的卡片价格并进行比较。 Problem is one of the websites limits the number of requests and sends back HTTP error 429 if I exceed it. 问题是其中一个网站限制了请求数量,如果我超过它,则发回HTTP错误429。

Is there a way to add throttling the number of requests in GRequestes so that I don't exceed the number of requests per second I specify? 有没有办法在GRequestes中添加限制请求数量,这样我就不会超过我指定的每秒请求数? Also - how can I make GRequestes retry after some time if HTTP 429 occurs? 另外 - 如果发生HTTP 429,我怎样才能使GRequestes在一段时间后重试?

On a side note - their limit is ridiculously low. 在旁注 - 他们的限制是非常低的。 Something like 8 requests per 15 seconds. 每15秒就有8个请求。 I breached it with my browser on multiple occasions just refreshing the page waiting for price changes. 我在浏览器中多次破坏它只是刷新页面等待价格变化。

Going to answer my own question since I had to figure this by myself and there seems to be very little info on this going around. 要回答我自己的问题,因为我必须自己解决这个问题,而且似乎很少有关于此问题的信息。

The idea is as follows. 这个想法如下。 Every request object used with GRequests can take a session object as a parameter when created. 与GRequests一起使用的每个请求对象都可以在创建时将会话对象作为参数。 Session objects on the other hand can have HTTP adapters mounted that are used when making requests. 另一方面,会话对象可以安装在发出请求时使用的HTTP适配器。 By creating our own adapter we can intercept requests and rate-limit them in way we find best for our application. 通过创建我们自己的适配器,我们可以拦截请求并以我们最适合我们应用程序的方式对它们进行速率限制。 In my case I ended up with the code below. 在我的情况下,我最终得到了以下代码。

Object used for throttling: 用于限制的对象:

DEFAULT_BURST_WINDOW = datetime.timedelta(seconds=5)
DEFAULT_WAIT_WINDOW = datetime.timedelta(seconds=15)


class BurstThrottle(object):
    max_hits = None
    hits = None
    burst_window = None
    total_window = None
    timestamp = None

    def __init__(self, max_hits, burst_window, wait_window):
        self.max_hits = max_hits
        self.hits = 0
        self.burst_window = burst_window
        self.total_window = burst_window + wait_window
        self.timestamp = datetime.datetime.min

    def throttle(self):
        now = datetime.datetime.utcnow()
        if now < self.timestamp + self.total_window:
            if (now < self.timestamp + self.burst_window) and (self.hits < self.max_hits):
                self.hits += 1
                return datetime.timedelta(0)
            else:
                return self.timestamp + self.total_window - now
        else:
            self.timestamp = now
            self.hits = 1
            return datetime.timedelta(0)

HTTP adapter: HTTP适配器:

class MyHttpAdapter(requests.adapters.HTTPAdapter):
    throttle = None

    def __init__(self, pool_connections=requests.adapters.DEFAULT_POOLSIZE,
                 pool_maxsize=requests.adapters.DEFAULT_POOLSIZE, max_retries=requests.adapters.DEFAULT_RETRIES,
                 pool_block=requests.adapters.DEFAULT_POOLBLOCK, burst_window=DEFAULT_BURST_WINDOW,
                 wait_window=DEFAULT_WAIT_WINDOW):
        self.throttle = BurstThrottle(pool_maxsize, burst_window, wait_window)
        super(MyHttpAdapter, self).__init__(pool_connections=pool_connections, pool_maxsize=pool_maxsize,
                                            max_retries=max_retries, pool_block=pool_block)

    def send(self, request, stream=False, timeout=None, verify=True, cert=None, proxies=None):
        request_successful = False
        response = None
        while not request_successful:
            wait_time = self.throttle.throttle()
            while wait_time > datetime.timedelta(0):
                gevent.sleep(wait_time.total_seconds(), ref=True)
                wait_time = self.throttle.throttle()

            response = super(MyHttpAdapter, self).send(request, stream=stream, timeout=timeout,
                                                       verify=verify, cert=cert, proxies=proxies)

            if response.status_code != 429:
                request_successful = True

        return response

Setup: 设定:

requests_adapter = adapter.MyHttpAdapter(
    pool_connections=__CONCURRENT_LIMIT__,
    pool_maxsize=__CONCURRENT_LIMIT__,
    max_retries=0,
    pool_block=False,
    burst_window=datetime.timedelta(seconds=5),
    wait_window=datetime.timedelta(seconds=20))

requests_session = requests.session()
requests_session.mount('http://', requests_adapter)
requests_session.mount('https://', requests_adapter)

unsent_requests = (grequests.get(url,
                                 hooks={'response': handle_response},
                                 session=requests_session) for url in urls)
grequests.map(unsent_requests, size=__CONCURRENT_LIMIT__)

Take a look at this for automatic requests throttling: https://pypi.python.org/pypi/RequestsThrottler/0.2.2 看看这个自动请求限制: https//pypi.python.org/pypi/RequestsThrottler/0.2.2

You can set both a fixed amount of delay between each request or set a number of requests to send in a fixed amount of seconds (which is basically the same thing): 您可以在每个请求之间设置固定数量的延迟,或者在固定的秒数内设置要发送的请求数(这基本上是相同的):

import requests
from requests_throttler import BaseThrottler

request = requests.Request(method='GET', url='http://www.google.com')
reqs = [request for i in range(0, 5)]  # An example list of requests
with BaseThrottler(name='base-throttler', delay=1.5) as bt:
    throttled_requests = bt.multi_submit(reqs)

where the function multi_submit returns a list of ThrottledRequest (see doc: link at the end). 其中函数multi_submit返回ThrottledRequest的列表(参见文档末尾的链接)。

You can then access to the responses: 然后,您可以访问响应:

for tr in throttled_requests:
    print tr.response

Alternatively you can achieve the same by specifying the number or requests to send in a fixed amount of time (eg 15 requests every 60 seconds): 或者,您可以通过指定在固定时间内发送的数量或请求(例如,每60秒15个请求)来实现相同的目标:

import requests
from requests_throttler import BaseThrottler

request = requests.Request(method='GET', url='http://www.google.com')
reqs = [request for i in range(0, 5)]  # An example list of requests
with BaseThrottler(name='base-throttler', reqs_over_time=(15, 60)) as bt:
    throttled_requests = bt.multi_submit(reqs)

Both solutions can be implemented without the usage of the with statement: 两种解决方案都可以在不使用with语句的情况下实现:

import requests
from requests_throttler import BaseThrottler

request = requests.Request(method='GET', url='http://www.google.com')
reqs = [request for i in range(0, 5)]  # An example list of requests
bt = BaseThrottler(name='base-throttler', delay=1.5)
bt.start()
throttled_requests = bt.multi_submit(reqs)
bt.shutdown()

For more details: http://pythonhosted.org/RequestsThrottler/index.html 有关更多详细信息,请访问: http//pythonhosted.org/RequestsThrottler/index.html

Doesn't look like there's any simple mechanism for handling this build in to the requests or grequests code. 看起来没有任何简单的机制来处理请求或grequests代码中的此内置。 The only hook that seems to be around is for responses. 似乎唯一的钩子是响应。

Here's a super hacky work-around to at least prove it's possible - I modified grequests to keep a list of the time when a request was issued and sleep the creation of the AsyncRequest until the requests per second were below the maximum. 这是一个超级hacky解决方案,至少证明它是可能的 - 我修改了grequests以保留发出请求的时间列表并休眠AsyncRequest的创建,直到每秒请求数低于最大值。

class AsyncRequest(object):
    def __init__(self, method, url, **kwargs):
        print self,'init'
        waiting=True
        while waiting:
            if len([x for x in q if x > time.time()-15]) < 8:
                q.append(time.time())
                waiting=False
            else:
                print self,'snoozing'
                gevent.sleep(1)

You can use grequests.imap() to watch this interactively 您可以使用grequests.imap()以交互方式观看此内容

import time
import rg

urls = [
        'http://www.heroku.com',
        'http://python-tablib.org',
        'http://httpbin.org',
        'http://python-requests.org',
        'http://kennethreitz.com',
        'http://www.cnn.com',
]

def print_url(r, *args, **kwargs):
        print(r.url),time.time()

hook_dict=dict(response=print_url)
rs = (rg.get(u, hooks=hook_dict) for u in urls)
for r in rg.imap(rs):
        print r

I wish there was a more elegant solution, but so far I can't find one. 我希望有一个更优雅的解决方案,但到目前为止我找不到一个。 Looked around in sessions and adapters. 在会话和适配器中查看。 Maybe the poolmanager could be augmented instead? 也许泳池管理员可以改为增强?

Also, I wouldn't put this code in production - the 'q' list never gets trimmed and would eventually get pretty big. 此外,我不会将此代码投入生产 - 'q'列表永远不会被修剪,最终会变得非常大。 Plus, I don't know if it's actually working as advertised. 另外,我不知道它是否真的像宣传的那样工作。 It just looks like it is when I look at the console output. 它看起来就像是在我查看控制台输出时。

Ugh. 啊。 Just looking at this code I can tell it's 3am. 只看这段代码,我可以告诉它凌晨3点。 Time to goto bed. 是时候去睡觉了。

I had a similar problem. 我遇到了类似的问题。 Here's my solution. 这是我的解决方案。 In your case, I would do: 在你的情况下,我会这样做:

def worker():
    with rate_limit('slow.domain.com', 2):
        response = requests.get('https://slow.domain.com/path')
        text = response.text
    # Use `text`

Assuming you have multiple domains you're culling from, I would setup a dictionary mapping (domain, delay) so you don't hit your rate limits. 假设你有多个域正在剔除,我会设置一个字典映射(domain, delay)这样你就不会达到你的速率限制。

This code assumes you're going to use gevent and monkey patch. 此代码假设您将使用gevent和monkey补丁。

from contextlib import contextmanager
from gevent.event import Event
from gevent.queue import Queue
from time import time


def rate_limit(resource, delay, _queues={}):
    """Delay use of `resource` until after `delay` seconds have passed.

    Example usage:

    def worker():
        with rate_limit('foo.bar.com', 1):
            response = requests.get('https://foo.bar.com/path')
            text = response.text
        # use `text`

    This will serialize and delay requests from multiple workers for resource
    'foo.bar.com' by 1 second.

    """

    if resource not in _queues:
        queue = Queue()
        gevent.spawn(_watch, queue)
        _queues[resource] = queue

    return _resource_manager(_queues[resource], delay)


def _watch(queue):
    "Watch `queue` and wake event listeners after delay."

    last = 0

    while True:
        event, delay = queue.get()

        now = time()

        if (now - last) < delay:
            gevent.sleep(delay - (now - last))

        event.set()   # Wake worker but keep control.
        event.clear()
        event.wait()  # Yield control until woken.

        last = time()


@contextmanager
def _resource_manager(queue, delay):
    "`with` statement support for `rate_limit`."

    event = Event()
    queue.put((event, delay))

    event.wait() # Wait for queue watcher to wake us.

    yield

    event.set()  # Wake queue watcher.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM