简体   繁体   中英

How can I schedule or queue api calls to maintain rate limit?

I am trying to continuously crawl a large amount of information from a site using the REST api they provide. I have following constraints-

  1. Stay within api limit (5 calls/sec)
  2. Utilising the full limit (making exactly 5 calls per second, 5*60 calls per minute)
  3. Each call will be with different parameters (params will be fetched from db or in-memory cache)
  4. Calls will be made from AWS EC2 (or GAE) and processed data will be stored in AWS RDS/DynamoDB

For now I am just using a scheduled task that runs a python script every minute- and the script makes 10-20 api calls-> processes response-> stores data to DB. I want to scale this procedure (make 5*60= 300 calls per minute) and make it manageable via code (pushing new tasks, pause/resuming them easily, monitoring failures, changing call frequency).

My question is- what are the best available tools to achieve this? Any suggestion/guidance/link is appreciated.

I do know the names of some task queuing frameworks like Celery/RabbitMQ/Redis, but I do not know much about them. However I am wiling to learn one or each of those if these are the best tools to solve my problem, want to hear from SO veterans before jumping in ☺
Also please let me know if there's any other AWS service I should look to use (SQS or AWS Data Pipeline?) to make any step easier.

You needn't add an external dependency just for rate-limiting, as your use case is rather straightforward.

I can think of two options:

  • Modify the script (that currently wakes up every minute and makes 10-20 API calls) to wake up every second and make 5 calls (sequentially or in parallel).
    • In your current design, your API calls might not be properly distributed across 1 minute, ie you might be making all your 10-20 calls in the first, say, 20 seconds.
    • If you change that script to run every second, your API call rate will be more balanced.
  • Change your Python script to a long running daemon, and use a Rate Limiter library, such as this . You can configure the latter to make 1 call per x seconds.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM