简体   繁体   English

如何在 Rails 中编写永无止境的工作(Web Scraping)?

[英]How can i write a never ending job in Rails (Web Scraping)?

Goal : I want to make a web scraper in a Rails app that runs indefinitely and can be scaled.目标:我想在无限期运行且可以缩放的 Rails 应用程序中制作一个网络爬虫。

Current stack app is running on: ROR/Heroku/Redis/Postgres当前堆栈应用程序正在运行: ROR/Heroku/Redis/Postgres

Idea : I was thinking of running a Sidekiq Job that runs every n minutes and checks if there are any proxies available to scrape with (these will be stored in a table with status sleeping/scraping).想法:我正在考虑运行一个每n分钟运行一次的Sidekiq作业,并检查是否有任何可以抓取的代理(这些将存储在具有睡眠/抓取状态的表中)。

Assuming there is a proxy available to scrape it will then check (using Sidekiq API ) if there is any available workers to start up another job to scrape with the available proxy.假设有一个可用于抓取的代理,然后它会检查(使用Sidekiq API )是否有任何可用的工作人员来启动另一个作业来使用可用的代理进行抓取。

This means i could scale the scraper by increasing number of workers and the number of available proxies.这意味着我可以通过增加工人数量和可用代理数量来扩展刮刀。 If for any reason the Job fails the Job that looks for available proxies will just start it again.如果由于任何原因作业失败,寻找可用代理的作业将再次启动它。

Questions : Is this the best solution for my goal?问题:这是实现我目标的最佳解决方案吗? Is utilizing long running Sidekiq jobs the best idea or could this blow up?利用长期运行的 Sidekiq 工作是最好的主意还是会失败?

Sidekiq is designed to run individual jobs which are "units of work" to your organization. Sidekiq 旨在为您的组织运行作为“工作单元”的单个作业。

You can build your own loop and, inside that loop, create jobs for each page to scrape but the loop itself should not be a job.您可以构建自己的循环,并在该循环内为要抓取的每个页面创建作业,但循环本身不应该是一项作业。

If you want a job to run every n minutes, you could schedule it.如果您希望作业每n分钟运行一次,您可以安排它。

And since you're using Heroku, there is an Add-on that : https://devcenter.heroku.com/articles/scheduler由于您使用的是 Heroku,因此有一个附加组件: https : //devcenter.heroku.com/articles/scheduler

Another solution would be to set cron jobs and schedule them with the whenever gem.另一种解决方案是设置 cron 作业并使用每当gem 安排它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM